VQArt-Bench: A semantically rich VQA Benchmark for Art and Cultural Heritage
2510.12750v1
cs.CV, cs.AI, cs.LG
2025-10-16
Авторы:
A. Alfarano, L. Venturoli, D. Negueruela del Castillo
Abstract
Multimodal Large Language Models (MLLMs) have demonstrated significant
capabilities in joint visual and linguistic tasks. However, existing Visual
Question Answering (VQA) benchmarks often fail to evaluate deep semantic
understanding, particularly in complex domains like visual art analysis.
Confined to simple syntactic structures and surface-level attributes, these
questions fail to capture the diversity and depth of human visual inquiry. This
limitation incentivizes models to exploit statistical shortcuts rather than
engage in visual reasoning. To address this gap, we introduce VQArt-Bench, a
new, large-scale VQA benchmark for the cultural heritage domain. This benchmark
is constructed using a novel multi-agent pipeline where specialized agents
collaborate to generate nuanced, validated, and linguistically diverse
questions. The resulting benchmark is structured along relevant visual
understanding dimensions that probe a model's ability to interpret symbolic
meaning, narratives, and complex visual relationships. Our evaluation of 14
state-of-the-art MLLMs on this benchmark reveals significant limitations in
current models, including a surprising weakness in simple counting tasks and a
clear performance gap between proprietary and open-source models.
Ссылки и действия
Дополнительные ресурсы: