VELA: An LLM-Hybrid-as-a-Judge Approach for Evaluating Long Image Captions
2509.25818v1
cs.CV, cs.AI, cs.CL
2025-10-02
Авторы:
Kazuki Matsuda, Yuiga Wada, Shinnosuke Hirano, Seitaro Otsuki, Komei Sugiura
Abstract
In this study, we focus on the automatic evaluation of long and detailed
image captions generated by multimodal Large Language Models (MLLMs). Most
existing automatic evaluation metrics for image captioning are primarily
designed for short captions and are not suitable for evaluating long captions.
Moreover, recent LLM-as-a-Judge approaches suffer from slow inference due to
their reliance on autoregressive inference and early fusion of visual
information. To address these limitations, we propose VELA, an automatic
evaluation metric for long captions developed within a novel
LLM-Hybrid-as-a-Judge framework. Furthermore, we propose LongCap-Arena, a
benchmark specifically designed for evaluating metrics for long captions. This
benchmark comprises 7,805 images, the corresponding human-provided long
reference captions and long candidate captions, and 32,246 human judgments from
three distinct perspectives: Descriptiveness, Relevance, and Fluency. We
demonstrated that VELA outperformed existing metrics and achieved superhuman
performance on LongCap-Arena.
Ссылки и действия
Дополнительные ресурсы: