Multi-modal Data Spectrum: Multi-modal Datasets are Multi-dimensional

2509.23499v1 cs.CV, cs.CL, cs.LG 2025-10-01

Авторы:

Divyam Madaan, Varshan Muhunthan, Kyunghyun Cho, Sumit Chopra

Abstract

Understanding the interplay between intra-modality dependencies (the contribution of an individual modality to a target task) and inter-modality dependencies (the relationships between modalities and the target task) is fundamental to advancing multi-modal learning. However, the nature of and interaction between these dependencies within current benchmark evaluations remains poorly characterized. In this work, we present a large-scale empirical study to quantify these dependencies across 23 visual question-answering benchmarks using multi-modal large language models (MLLMs) covering domains such as general and expert knowledge reasoning, optical character recognition, and document understanding. Our findings show that the reliance on vision, question (text), and their interaction varies significantly, both across and within benchmarks. We discover that numerous benchmarks intended to mitigate text-only biases have inadvertently amplified image-only dependencies. This characterization persists across model sizes, as larger models often use these intra-modality dependencies to achieve high performance that mask an underlying lack of multi-modal reasoning. We provide a quantitative characterization of multi-modal datasets, enabling a principled approach to multi-modal benchmark design and evaluation.

Ссылки и действия

Читать на arXiv Скачать PDF

Дополнительные ресурсы:

Multi-modal Data Spectrum: Multi-modal Datasets are Multi-dimensional

Авторы:

Abstract

Ссылки и действия

Связанные статьи

Optical Context Compression Is Just (Bad) Autoencoding

What Shape Is Optimal for Masks in Text Removal?

Training-Free Generation of Diverse and High-Fidelity Images via Prompt Semantic...

EchoAgent: Guideline-Centric Reasoning Agent for Echocardiography Measurement an...

O3SLM: Open Weight, Open Data, and Open Vocabulary Sketch-Language Model

Навигация