Multi-modal Data Spectrum: Multi-modal Datasets are Multi-dimensional
2509.23499v1
cs.CV, cs.CL, cs.LG
2025-10-01
Авторы:
Divyam Madaan, Varshan Muhunthan, Kyunghyun Cho, Sumit Chopra
Abstract
Understanding the interplay between intra-modality dependencies (the
contribution of an individual modality to a target task) and inter-modality
dependencies (the relationships between modalities and the target task) is
fundamental to advancing multi-modal learning. However, the nature of and
interaction between these dependencies within current benchmark evaluations
remains poorly characterized. In this work, we present a large-scale empirical
study to quantify these dependencies across 23 visual question-answering
benchmarks using multi-modal large language models (MLLMs) covering domains
such as general and expert knowledge reasoning, optical character recognition,
and document understanding. Our findings show that the reliance on vision,
question (text), and their interaction varies significantly, both across and
within benchmarks. We discover that numerous benchmarks intended to mitigate
text-only biases have inadvertently amplified image-only dependencies. This
characterization persists across model sizes, as larger models often use these
intra-modality dependencies to achieve high performance that mask an underlying
lack of multi-modal reasoning. We provide a quantitative characterization of
multi-modal datasets, enabling a principled approach to multi-modal benchmark
design and evaluation.
Ссылки и действия
Дополнительные ресурсы: