AURA Score: A Metric For Holistic Audio Question Answering Evaluation
2510.04934v1
eess.AS, cs.AI
2025-10-08
Авторы:
Satvik Dixit, Soham Deshmukh, Bhiksha Raj
Abstract
Audio Question Answering (AQA) is a key task for evaluating Audio-Language
Models (ALMs), yet assessing open-ended responses remains challenging. Existing
metrics used for AQA such as BLEU, METEOR and BERTScore, mostly adapted from
NLP and audio captioning, rely on surface similarity and fail to account for
question context, reasoning, and partial correctness. To address the gap in
literature, we make three contributions in this work. First, we introduce
AQEval to enable systematic benchmarking of AQA metrics. It is the first
benchmark of its kind, consisting of 10k model responses annotated by multiple
humans for their correctness and relevance. Second, we conduct a comprehensive
analysis of existing AQA metrics on AQEval, highlighting weak correlation with
human judgment, especially for longer answers. Third, we propose a new metric -
AURA score, to better evaluate open-ended model responses. On AQEval, AURA
achieves state-of-the-art correlation with human ratings, significantly
outperforming all baselines. Through this work, we aim to highlight the
limitations of current AQA evaluation methods and motivate better metrics. We
release both the AQEval benchmark and the AURA metric to support future
research in holistic AQA evaluation.
Ссылки и действия
Дополнительные ресурсы: