📊 Статистика дайджестов

Всего дайджестов: 34022 Добавлено сегодня: 0

Последнее обновление: сегодня

📄 GHOST: Hallucination-Inducing Image Generation for Multimodal LLMs

2025-10-02

Авторы:

Aryan Yazdan Parast, Parsa Hosseini, Hesam Asadollahzadeh, Arshia Soltani Moakhar, Basim Azam, Soheil Feizi, Naveed Akhtar

Саммари на русском не найдено
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']

Annotation:

Object hallucination in Multimodal Large Language Models (MLLMs) is a persistent failure mode that causes the model to perceive objects absent in the image. This weakness of MLLMs is currently studied using static benchmarks with fixed visual scenarios, which preempts the possibility of uncovering model-specific or unanticipated hallucination vulnerabilities. We introduce GHOST (Generating Hallucinations via Optimizing Stealth Tokens), a method designed to stress-test MLLMs by actively generatin...

ID: 2509.25178v1 cs.CV, cs.AI, cs.LG

arXiv PDF

📄 VisualOverload: Probing Visual Understanding of VLMs in Really Dense Scenes

2025-10-02

Авторы:

Paul Gavrikov, Wei Lin, M. Jehanzeb Mirza, Soumya Jahagirdar, Muhammad Huzaifa, Sivan Doveh, Serena Yeung-Levy, James Glass, Hilde Kuehne

Саммари на русском не найдено
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']

Annotation:

Is basic visual understanding really solved in state-of-the-art VLMs? We present VisualOverload, a slightly different visual question answering (VQA) benchmark comprising 2,720 question-answer pairs, with privately held ground-truth responses. Unlike prior VQA datasets that typically focus on near global image understanding, VisualOverload challenges models to perform simple, knowledge-free vision tasks in densely populated (or, overloaded) scenes. Our dataset consists of high-resolution scans o...

ID: 2509.25339v2 cs.CV, cs.AI, cs.LG, eess.IV

arXiv PDF

📄 Hybrid Approach for Enhancing Lesion Segmentation in Fundus Images

2025-10-02

Авторы:

Mohammadmahdi Eshragh, Emad A. Mohammed, Behrouz Far, Ezekiel Weis, Carol L Shields, Sandor R Ferenczy, Trafford Crump

Саммари на русском не найдено
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']

Annotation:

Choroidal nevi are common benign pigmented lesions in the eye, with a small risk of transforming into melanoma. Early detection is critical to improving survival rates, but misdiagnosis or delayed diagnosis can lead to poor outcomes. Despite advancements in AI-based image analysis, diagnosing choroidal nevi in colour fundus images remains challenging, particularly for clinicians without specialized expertise. Existing datasets often suffer from low resolution and inconsistent labelling, limiting...

ID: 2509.25549v1 cs.CV, cs.AI, cs.LG

arXiv PDF

📄 YOLO-Based Defect Detection for Metal Sheets

2025-10-02

Авторы:

Po-Heng Chou, Chun-Chi Wang, Wei-Lung Mao

Саммари на русском не найдено
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']

Annotation:

In this paper, we propose a YOLO-based deep learning (DL) model for automatic defect detection to solve the time-consuming and labor-intensive tasks in industrial manufacturing. In our experiments, the images of metal sheets are used as the dataset for training the YOLO model to detect the defects on the surfaces and in the holes of metal sheets. However, the lack of metal sheet images significantly degrades the performance of detection accuracy. To address this issue, the ConSinGAN is used to g...

ID: 2509.25659v1 cs.CV, cs.AI, cs.LG, eess.IV, eess.SP, 68T45, 68T07, I.2.10; I.4.7; I.5.4

arXiv PDF

📄 SeMoBridge: Semantic Modality Bridge for Efficient Few-Shot Adaptation of CLIP

2025-10-02

Авторы:

Christoph Timmermann, Hyunse Lee, Woojin Lee

Саммари на русском не найдено
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']

Annotation:

While Contrastive Language-Image Pretraining (CLIP) excels at zero-shot tasks by aligning image and text embeddings, its performance in few-shot classification is hindered by a critical limitation: intra-modal misalignment. This issue, caused by a persistent modality gap and CLIP's exclusively inter-modal training objective, leaves the embedding spaces uncalibrated, making direct image-to-image comparisons unreliable. Existing methods attempt to address this by refining similarity logits or by c...

ID: 2509.26036v2 cs.CV, cs.AI, cs.LG

arXiv PDF

📄 EntroPE: Entropy-Guided Dynamic Patch Encoder for Time Series Forecasting

2025-10-02

Авторы:

Sachith Abeywickrama, Emadeldeen Eldele, Min Wu, Xiaoli Li, Chau Yuen

Саммари на русском не найдено
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']

Annotation:

Transformer-based models have significantly advanced time series forecasting, with patch-based input strategies offering efficiency and improved long-horizon modeling. Yet, existing approaches rely on temporally-agnostic patch construction, where arbitrary starting positions and fixed lengths fracture temporal coherence by splitting natural transitions across boundaries. This naive segmentation often disrupts short-term dependencies and weakens representation learning. In response, we propose En...

ID: 2509.26157v1 cs.CV, cs.AI, cs.LG

arXiv PDF

📄 AttriGen: Automated Multi-Attribute Annotation for Blood Cell Datasets

2025-10-02

Авторы:

Walid Houmaidi, Youssef Sabiri, Fatima Zahra Iguenfer, Amine Abouaomar

Саммари на русском не найдено
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']

Annotation:

We introduce AttriGen, a novel framework for automated, fine-grained multi-attribute annotation in computer vision, with a particular focus on cell microscopy where multi-attribute classification remains underrepresented compared to traditional cell type categorization. Using two complementary datasets: the Peripheral Blood Cell (PBC) dataset containing eight distinct cell types and the WBC Attribute Dataset (WBCAtt) that contains their corresponding 11 morphological attributes, we propose a dua...

ID: 2509.26185v1 cs.CV, cs.AI, cs.LG, 60G35, 62M10, 62P35, 65C20, 68T45, 68U10, 92C35, 92C40, 92C42, 93E10, I.4; I.4.8; I.4.9; I.4.10; I.2; I.2.6; I.2.10; J.3; C.2.4; C.3; H.2.8; H.3.4; H.3.5; I.2.4; I.5; I.5.1; I.5.4; K.6.1

arXiv PDF

📄 Attention over Scene Graphs: Indoor Scene Representations Toward CSAI Classification

2025-10-02

Авторы:

Artur Barros, Carlos Caetano, João Macedo, Jefersson A. dos Santos, Sandra Avila

Саммари на русском не найдено
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']

Annotation:

Indoor scene classification is a critical task in computer vision, with wide-ranging applications that go from robotics to sensitive content analysis, such as child sexual abuse imagery (CSAI) classification. The problem is particularly challenging due to the intricate relationships between objects and complex spatial layouts. In this work, we propose the Attention over Scene Graphs for Sensitive Content Analysis (ASGRA), a novel framework that operates on structured graph representations instea...

ID: 2509.26457v1 cs.CV, cs.AI, cs.LG

arXiv PDF

📄 Stitch: Training-Free Position Control in Multimodal Diffusion Transformers

2025-10-02

Авторы:

Jessica Bader, Mateusz Pach, Maria A. Bravo, Serge Belongie, Zeynep Akata

Саммари на русском не найдено
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']

Annotation:

Text-to-Image (T2I) generation models have advanced rapidly in recent years, but accurately capturing spatial relationships like "above" or "to the right of" poses a persistent challenge. Earlier methods improved spatial relationship following with external position control. However, as architectures evolved to enhance image quality, these techniques became incompatible with modern models. We propose Stitch, a training-free method for incorporating external position control into Multi-Modal Diff...

ID: 2509.26644v1 cs.CV, cs.AI, cs.LG

arXiv PDF

📄 Convolutional Set Transformer

2025-10-01

Авторы:

Federico Chinello, Giacomo Boracchi

Саммари на русском не найдено
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']

Annotation:

We introduce the Convolutional Set Transformer (CST), a novel neural architecture designed to process image sets of arbitrary cardinality that are visually heterogeneous yet share high-level semantics - such as a common category, scene, or concept. Existing set-input networks, e.g., Deep Sets and Set Transformer, are limited to vector inputs and cannot directly handle 3D image tensors. As a result, they must be cascaded with a feature extractor, typically a CNN, which encodes images into embeddi...

ID: 2509.22889v1 cs.CV, cs.AI, cs.LG

arXiv PDF

Показано 211 - 220 из 358 записей