What You See is What You Ask: Evaluating Audio Descriptions
2510.00808v1
cs.CV, cs.AI, cs.CL
2025-10-04
Авторы:
Divy Kala, Eshika Khandelwal, Makarand Tapaswi
Abstract
Audio descriptions (ADs) narrate important visual details in movies, enabling
Blind and Low Vision (BLV) users to understand narratives and appreciate visual
details. Existing works in automatic AD generation mostly focus on few-second
trimmed clips, and evaluate them by comparing against a single ground-truth
reference AD. However, writing ADs is inherently subjective. Through alignment
and analysis of two independent AD tracks for the same movies, we quantify the
subjectivity in when and whether to describe, and what and how to highlight.
Thus, we show that working with trimmed clips is inadequate. We propose ADQA, a
QA benchmark that evaluates ADs at the level of few-minute long, coherent video
segments, testing whether they would help BLV users understand the story and
appreciate visual details. ADQA features visual appreciation (VA) questions
about visual facts and narrative understanding (NU) questions based on the
plot. Through ADQA, we show that current AD generation methods lag far behind
human-authored ADs. We conclude with several recommendations for future work
and introduce a public leaderboard for benchmarking.
Ссылки и действия
Дополнительные ресурсы: