MMLongCite: A Benchmark for Evaluating Fidelity of Long-Context Vision-Language Models
2510.13276v1
cs.CV, cs.CL
2025-10-17
Авторы:
Keyan Zhou, Zecheng Tang, Lingfeng Ming, Guanghao Zhou, Qiguang Chen, Dan Qiao, Zheming Yang, Libo Qin, Minghui Qiu, Juntao Li, Min Zhang
Abstract
The rapid advancement of large vision language models (LVLMs) has led to a
significant expansion of their context windows. However, an extended context
window does not guarantee the effective utilization of the context, posing a
critical challenge for real-world applications. Current evaluations of such
long-context faithfulness are predominantly focused on the text-only domain,
while multimodal assessments remain limited to short contexts. To bridge this
gap, we introduce MMLongCite, a comprehensive benchmark designed to evaluate
the fidelity of LVLMs in long-context scenarios. MMLongCite comprises 8
distinct tasks spanning 6 context length intervals and incorporates diverse
modalities, including text, images, and videos. Our evaluation of
state-of-the-art LVLMs reveals their limited faithfulness in handling long
multimodal contexts. Furthermore, we provide an in-depth analysis of how
context length and the position of crucial content affect the faithfulness of
these models.
Ссылки и действия
Дополнительные ресурсы: