📊 Статистика дайджестов

Всего дайджестов: 34022 Добавлено сегодня: 82

Последнее обновление: сегодня

📄 SRUM: Fine-Grained Self-Rewarding for Unified Multimodal Models

2025-10-16

Авторы:

Weiyang Jin, Yuwei Niu, Jiaqi Liao, Chengqi Duan, Aoxue Li, Shenghua Gao, Xihui Liu

Саммари на русском не найдено
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']

Annotation:

Recently, remarkable progress has been made in Unified Multimodal Models (UMMs), which integrate vision-language generation and understanding capabilities within a single framework. However, a significant gap exists where a model's strong visual understanding often fails to transfer to its visual generation. A model might correctly understand an image based on user instructions, yet be unable to generate a faithful image from text prompts. This phenomenon directly raises a compelling question: C...

ID: 2510.12784v1 cs.CV, cs.CL, I.4.0

arXiv PDF

📄 Task-Aware Resolution Optimization for Visual Large Language Models

2025-10-15

Авторы:

Weiqing Luo, Zhen Tan, Yifan Li, Xinyu Zhao, Kwonjoon Lee, Behzad Dariush, Tianlong Chen

Саммари на русском не найдено
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']

Annotation:

Real-world vision-language applications demand varying levels of perceptual granularity. However, most existing visual large language models (VLLMs), such as LLaVA, pre-assume a fixed resolution for downstream tasks, which leads to subpar performance. To address this problem, we first conduct a comprehensive and pioneering investigation into the resolution preferences of different vision-language tasks, revealing a correlation between resolution preferences with image complexity, and uncertainty...

ID: 2510.09822v1 cs.CV, cs.CL

arXiv PDF

📄 The Visual Iconicity Challenge: Evaluating Vision-Language Models on Sign Language Form-Meaning Mapping

2025-10-14

Авторы:

Onur Keleş, Aslı Özyürek, Gerardo Ortega, Kadir Gökgöz, Esam Ghaleb

Саммари на русском не найдено
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']

Annotation:

Iconicity, the resemblance between linguistic form and meaning, is pervasive in signed languages, offering a natural testbed for visual grounding. For vision-language models (VLMs), the challenge is to recover such essential mappings from dynamic human motion rather than static context. We introduce the Visual Iconicity Challenge, a novel video-based benchmark that adapts psycholinguistic measures to evaluate VLMs on three tasks: (i) phonological sign-form prediction (e.g., handshape, location),...

ID: 2510.08482v2 cs.CV, cs.CL

arXiv PDF

📄 SliceFine: The Universal Winning-Slice Hypothesis for Pretrained Networks

2025-10-14

Авторы:

Md Kowsher, Ali O. Polat, Ehsan Mohammady Ardehaly, Mehrdad Salehi, Zia Ghiasi, Prasanth Murali, Chen Chen

Саммари на русском не найдено
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']

Annotation:

This paper presents a theoretical framework explaining why fine tuning small, randomly selected subnetworks (slices) within pre trained models can be sufficient for downstream adaptation. We prove that pretrained networks exhibit a universal winning slice property arising from two phenomena: (1) spectral balance the eigenspectra of different weight matrix slices are remarkably similar; and (2) high task energy their backbone representations retain rich, task relevant features. This leads to the ...

ID: 2510.08513v2 cs.CV, cs.CL

arXiv PDF

📄 Unleashing Perception-Time Scaling to Multimodal Reasoning Models

2025-10-14

Авторы:

Yifan Li, Zhenghao Chen, Ziheng Wu, Kun Zhou, Ruipu Luo, Can Zhang, Zhentao He, Yufei Zhan, Wayne Xin Zhao, Minghui Qiu

Саммари на русском не найдено
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']

Annotation:

Recent advances in inference-time scaling, particularly those leveraging reinforcement learning with verifiable rewards, have substantially enhanced the reasoning capabilities of Large Vision-Language Models (LVLMs). Inspired by this success, similar strategies have been applied to multimodal reasoning, yet their impact on visual perception remains unclear. To investigate this gap, we introduce DisTANCE, a perception-centric benchmark for visual estimation tasks. Evaluation results show that LVL...

ID: 2510.08964v1 cs.CV, cs.CL

arXiv PDF

📄 The Visual Iconicity Challenge: Evaluating Vision-Language Models on Sign Language Form-Meaning Mapping

2025-10-11

Авторы:

Onur Keleş, Aslı Özyürek, Gerardo Ortega, Kadir Gökgö, Esam Ghaleb

Саммари на русском не найдено
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']

Annotation:

Iconicity, the resemblance between linguistic form and meaning, is pervasive in signed languages, offering a natural testbed for visual grounding. For vision-language models (VLMs), the challenge is to recover such essential mappings from dynamic human motion rather than static context. We introduce the \textit{Visual Iconicity Challenge}, a novel video-based benchmark that adapts psycholinguistic measures to evaluate VLMs on three tasks: (i) phonological sign-form prediction (e.g., handshape, l...

ID: 2510.08482v1 cs.CV, cs.CL

arXiv PDF

📄 SliceFine: The Universal Winning-Slice Hypothesis for Pretrained Networks

2025-10-11

Авторы:

Md Kowsher, Ali O. Polat, Ehsan Mohammady Ardehaly, Mehrdad Salehi, Zia Ghiasi, Prasanth Murali, Chen Chen

Саммари на русском не найдено
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']

Annotation:

ID: 2510.08513v1 cs.CV, cs.CL

arXiv PDF

📄 Person-Centric Annotations of LAION-400M: Auditing Bias and Its Transfer to Models

2025-10-08

Авторы:

Leander Girrbach, Stephan Alaniz, Genevieve Smith, Trevor Darrell, Zeynep Akata

Саммари на русском не найдено
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']

Annotation:

Vision-language models trained on large-scale multimodal datasets show strong demographic biases, but the role of training data in producing these biases remains unclear. A major barrier has been the lack of demographic annotations in web-scale datasets such as LAION-400M. We address this gap by creating person-centric annotations for the full dataset, including over 276 million bounding boxes, perceived gender and race/ethnicity labels, and automatically generated captions. These annotations ar...

ID: 2510.03721v1 cs.CV, cs.CL, cs.CY, cs.LG

arXiv PDF

📄 No Tokens Wasted: Leveraging Long Context in Biomedical Vision-Language Models

2025-10-08

Авторы:

Min Woo Sun, Alejandro Lozano, Javier Gamazo Tejero, Vishwesh Nath, Xiao Xiao Sun, James Burgess, Yuhui Zhang, Kun Yuan, Robert Tibshirani, Sean Huver, Serena Yeung-Levy

Саммари на русском не найдено
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']

Annotation:

Embedding vision-language models (VLMs) are typically pretrained with short text windows (<77 tokens), which forces the truncation of long-format captions. Yet, the distribution of biomedical captions from large-scale open source literature reveals that a huge portion of captions far exceed 77 tokens. To this end, we investigate the impact of pretraining on long-format biomedical captions by extending the context length of text encoders in VLMs. We find that longer context (thus, enabling additi...

ID: 2510.03978v1 cs.CV, cs.CL

arXiv PDF

📄 Enhancing OCR for Sino-Vietnamese Language Processing via Fine-tuned PaddleOCRv5

2025-10-08

Авторы:

Minh Hoang Nguyen, Su Nguyen Thiet

Саммари на русском не найдено
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']

Annotation:

Recognizing and processing Classical Chinese (Han-Nom) texts play a vital role in digitizing Vietnamese historical documents and enabling cross-lingual semantic research. However, existing OCR systems struggle with degraded scans, non-standard glyphs, and handwriting variations common in ancient sources. In this work, we propose a fine-tuning approach for PaddleOCRv5 to improve character recognition on Han-Nom texts. We retrain the text recognition module using a curated subset of ancient Vietna...

ID: 2510.04003v1 cs.CV, cs.CL, 68T50, 68T50, 68T10, I.2.7; I.5; I.7.5

arXiv PDF

Показано 91 - 100 из 185 записей