📊 Статистика дайджестов

Всего дайджестов: 34022 Добавлено сегодня: 82

Последнее обновление: сегодня

📄 FoR-SALE: Frame of Reference-guided Spatial Adjustment in LLM-based Diffusion Editing

2025-10-01

Авторы:

Tanawan Premsri, Parisa Kordjamshidi

Саммари на русском не найдено
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']

Annotation:

Frame of Reference (FoR) is a fundamental concept in spatial reasoning that humans utilize to comprehend and describe space. With the rapid progress in Multimodal Language models, the moment has come to integrate this long-overlooked dimension into these models. In particular, in text-to-image (T2I) generation, even state-of-the-art models exhibit a significant performance gap when spatial descriptions are provided from perspectives other than the camera. To address this limitation, we propose F...

ID: 2509.23452v1 cs.CV, cs.CL

arXiv PDF

📄 Multi-modal Data Spectrum: Multi-modal Datasets are Multi-dimensional

2025-10-01

Авторы:

Divyam Madaan, Varshan Muhunthan, Kyunghyun Cho, Sumit Chopra

Саммари на русском не найдено
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']

Annotation:

Understanding the interplay between intra-modality dependencies (the contribution of an individual modality to a target task) and inter-modality dependencies (the relationships between modalities and the target task) is fundamental to advancing multi-modal learning. However, the nature of and interaction between these dependencies within current benchmark evaluations remains poorly characterized. In this work, we present a large-scale empirical study to quantify these dependencies across 23 visu...

ID: 2509.23499v1 cs.CV, cs.CL, cs.LG

arXiv PDF

📄 HomeSafeBench: A Benchmark for Embodied Vision-Language Models in Free-Exploration Home Safety Inspection

2025-10-01

Авторы:

Siyuan Gao, Jiashu Yao, Haoyu Wen, Yuhang Guo, Zeming Liu, Heyan Huang

Саммари на русском не найдено
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']

Annotation:

Embodied agents can identify and report safety hazards in the home environments. Accurately evaluating their capabilities in home safety inspection tasks is curcial, but existing benchmarks suffer from two key limitations. First, they oversimplify safety inspection tasks by using textual descriptions of the environment instead of direct visual information, which hinders the accurate evaluation of embodied agents based on Vision-Language Models (VLMs). Second, they use a single, static viewpoint ...

ID: 2509.23690v1 cs.CV, cs.CL

arXiv PDF

📄 Generalist Scanner Meets Specialist Locator: A Synergistic Coarse-to-Fine Framework for Robust GUI Grounding

2025-10-01

Авторы:

Zhecheng Li, Guoxian Song, Yiwei Wang, Zhen Xiong, Junsong Yuan, Yujun Cai

Саммари на русском не найдено
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']

Annotation:

Grounding natural language queries in graphical user interfaces (GUIs) presents a challenging task that requires models to comprehend diverse UI elements across various applications and systems, while also accurately predicting the spatial coordinates for the intended operation. To tackle this problem, we propose GMS: Generalist Scanner Meets Specialist Locator, a synergistic coarse-to-fine framework that effectively improves GUI grounding performance. GMS leverages the complementary strengths o...

ID: 2509.24133v1 cs.CV, cs.CL

arXiv PDF

📄 Latent Visual Reasoning

2025-10-01

Авторы:

Bangzheng Li, Ximeng Sun, Jiang Liu, Ze Wang, Jialian Wu, Xiaodong Yu, Hao Chen, Emad Barsoum, Muhao Chen, Zicheng Liu

Саммари на русском не найдено
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']

Annotation:

Multimodal Large Language Models (MLLMs) have achieved notable gains in various tasks by incorporating Chain-of-Thought (CoT) reasoning in language spaces. Recent work extends this direction by leveraging external tools for visual editing, thereby enhancing the visual signal along the reasoning trajectories. Nevertheless, these approaches remain fundamentally constrained: reasoning is still confined to the language space, with visual information treated as static preconditions. We introduce Late...

ID: 2509.24251v1 cs.CV, cs.CL

arXiv PDF

📄 Beyond Isolated Facts: Synthesizing Narrative and Grounded Supervision for VideoQA

2025-10-01

Авторы:

Jianxin Liang, Tan Yue, Yuxuan Wang, Yueqian Wang, Zhihan Yin, Huishuai Zhang, Dongyan Zhao

Саммари на русском не найдено
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']

Annotation:

The performance of Video Question Answering (VideoQA) models is fundamentally constrained by the nature of their supervision, which typically consists of isolated, factual question-answer pairs. This "bag-of-facts" approach fails to capture the underlying narrative and causal structure of events, limiting models to a shallow understanding of video content. To move beyond this paradigm, we introduce a framework to synthesize richer supervisory signals. We propose two complementary strategies: Que...

ID: 2509.24445v1 cs.CV, cs.CL

arXiv PDF

📄 NeMo: Needle in a Montage for Video-Language Understanding

2025-10-01

Авторы:

Zi-Yuan Hu, Shuo Liang, Duo Zheng, Yanyang Li, Yeyao Tao, Shijia Huang, Wei Feng, Jia Qin, Jianguang Yu, Jing Huang, Meng Fang, Yin Li, Liwei Wang

Саммари на русском не найдено
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']

Annotation:

Recent advances in video large language models (VideoLLMs) call for new evaluation protocols and benchmarks for complex temporal reasoning in video-language understanding. Inspired by the needle in a haystack test widely used by LLMs, we introduce a novel task of Needle in a Montage (NeMo), designed to assess VideoLLMs' critical reasoning capabilities, including long-context recall and temporal grounding. To generate video question answering data for our task, we develop a scalable automated dat...

ID: 2509.24563v1 cs.CV, cs.CL

arXiv PDF

📄 MMRQA: Signal-Enhanced Multimodal Large Language Models for MRI Quality Assessment

2025-10-01

Авторы:

Fankai Jia, Daisong Gan, Zhe Zhang, Zhaochi Wen, Chenchen Dan, Dong Liang, Haifeng Wang

Саммари на русском не найдено
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']

Annotation:

Magnetic resonance imaging (MRI) quality assessment is crucial for clinical decision-making, yet remains challenging due to data scarcity and protocol variability. Traditional approaches face fundamental trade-offs: signal-based methods like MRIQC provide quantitative metrics but lack semantic understanding, while deep learning approaches achieve high accuracy but sacrifice interpretability. To address these limitations, we introduce the Multimodal MRI Quality Assessment (MMRQA) framework, pione...

ID: 2509.24888v1 cs.CV, cs.CL

arXiv PDF

📄 TemMed-Bench: Evaluating Temporal Medical Image Reasoning in Vision-Language Models

2025-10-01

Авторы:

Junyi Zhang, Jia-Chen Gu, Wenbo Hu, Yu Zhou, Robinson Piramuthu, Nanyun Peng

Саммари на русском не найдено
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']

Annotation:

Existing medical reasoning benchmarks for vision-language models primarily focus on analyzing a patient's condition based on an image from a single visit. However, this setting deviates significantly from real-world clinical practice, where doctors typically refer to a patient's historical conditions to provide a comprehensive assessment by tracking their changes over time. In this paper, we introduce TemMed-Bench, the first benchmark designed for analyzing changes in patients' conditions betwee...

ID: 2509.25143v1 cs.CV, cs.CL

arXiv PDF

📄 VideoJudge: Bootstrapping Enables Scalable Supervision of MLLM-as-a-Judge for Video Understanding

2025-09-30

Авторы:

Abdul Waheed, Zhen Wu, Dareen Alharthi, Seungone Kim, Bhiksha Raj

## Контекст Оценка качества видео понимания по-прежнему представляет серьезные трудности. Обычно используются метрики, такие как BLEU, ROUGE и BERTScore, но они не могут точно отразить тонкости человеческого суждения. Обзорные ручные оценки, хотя и являются достоверными, требуют больших затрат времени и ресурсов. Недавние исследования затрагивали возможность использования больших языковых моделей (LLMs) или многомодальных языковых моделей (MLLMs) для автоматизации этой задачи. Однако их применение к видеопониманию еще относительно не исследовано. Мы предлагаем VideoJudge, 3B и 7B-размерные MLLM-модели, оптимизированные для оценки выводов моделей видеопонимания в виде текстовых ответов, ориентированных на видео. Мы предлагаем новую подходящую методологию для обучения VideoJudge, которая использует взаимодействие между генератором и оценщиком, чтобы обеспечить точные и целесообразные результаты. ## Метод Мы разработали VideoJudge на основе метода обучения с подкреплением. Наша модель подразделяется на две части: генератор, который генерирует ответы на видео, и оценщик, который использует многомодальную модель (MLLM) для точной оценки этих ответов. Ответы, которые не соответствуют целевому рейтингу, отбрасываются. Мы использовали 3B и 7B-параметры для VideoJudge, чтобы достичь баланса между точностью и эффективностью. Для обучения мы использовали широкий набор видеозадач, включая видео-детектирование, видео-качество и видео-понимание. Оценка модели производилась на нескольких метриках, включая BLEU, ROUGE и BERTScore, а также на пользовательской оценке качества. ## Результаты Мы провести многочисленные эксперименты, сравнивая VideoJudge с другими MLLM-моделями, такими как Qwen2.5-VL. Мы проверяли модель на трех мета-оценочных бенчмарках в области видеопонимания. VideoJudge-7B показал значительные преимущества по сравнению с более крупными моделями, такими как Qwen2.5-VL (32B и 72B). Мы также обнаружили, что цепочки мыслей при рандомизированном обучении не дают дополнительного выигрыша, что подтверждает значимость ввода видео как ключевого фактора для точной оценки моделей видеопонимания. ## Значимость Мы видим широкие области применения VideoJudge в сфере видеопонимания, таких как видео-детектирование, видео-описание и видео-классификация. Модель имеет потенциал для создания эффективных и точных систем, которые могут точно оценивать выводы моделей видеопонимания без ручного вмешательства. Это предлагает значительные преимущества в скорости и стоимости процесса оценки. Мы также отмечаем, что наш подход может быть рас

Annotation:

Precisely evaluating video understanding models remains challenging: commonly used metrics such as BLEU, ROUGE, and BERTScore fail to capture the fineness of human judgment, while obtaining such judgments through manual evaluation is costly. Recent work has explored using large language models (LLMs) or multimodal LLMs (MLLMs) as evaluators, but their extension to video understanding remains relatively unexplored. In this work, we introduce VideoJudge, a 3B and 7B-sized MLLM judge specialized to...

ID: 2509.21451v1 cs.CV, cs.CL

arXiv PDF

Показано 111 - 120 из 185 записей