Res-Bench: Benchmarking the Robustness of Multimodal Large Language Models to Dynamic Resolution Input

2510.16926v1 cs.CV, cs.CL 2025-10-22

Авторы:

Chenxu Li, Zhicai Wang, Yuan Sheng, Xingyu Zhu, Yanbin Hao, Xiang Wang

Abstract

Multimodal Large Language Models (MLLMs) increasingly support dynamic image resolutions. However, current evaluation paradigms primarily assess semantic performance, overlooking the critical question of resolution robustness - whether performance remains stable across varying input resolutions. To address this gap, we introduce \textbf{Res-Bench}, a comprehensive benchmark comprising 14,400 samples across 12 resolution levels and six core capability dimensions. We designed a novel evaluation framework that goes beyond traditional accuracy metrics to capture performance stability. This framework introduces multiple robustness metrics: Spearman's correlation for assessing resolution-performance trends, and Absolute/Relative Continuous Error (ACE/RCE) for measuring performance volatility. Using these metrics, we conducted a large-scale evaluation of leading MLLMs. Our analysis encompasses: (1) model-centric and task-centric robustness examination, (2) investigation of preprocessing strategies including padding and super-resolution, and (3) exploration of fine-tuning for stability enhancement.

Ссылки и действия

Читать на arXiv Скачать PDF

Дополнительные ресурсы:

Res-Bench: Benchmarking the Robustness of Multimodal Large Language Models to Dynamic Resolution Input

Авторы:

Abstract

Ссылки и действия

Связанные статьи

Text-Only Training for Image Captioning with Retrieval Augmentation and Modality...

Generalized Medical Phrase Grounding

CartoMapQA: A Fundamental Benchmark Dataset Evaluating Vision-Language Models on...

Thinking with Programming Vision: Towards a Unified View for Thinking with Image...

See, Think, Learn: A Self-Taught Multimodal Reasoner

Навигация