More than a Moment: Towards Coherent Sequences of Audio Descriptions
2510.25440v1
cs.CV, cs.CL
2025-10-31
Авторы:
Eshika Khandelwal, Junyu Xie, Tengda Han, Max Bain, Arsha Nagrani, Andrew Zisserman, Gül Varol, Makarand Tapaswi
Abstract
Audio Descriptions (ADs) convey essential on-screen information, allowing
visually impaired audiences to follow videos. To be effective, ADs must form a
coherent sequence that helps listeners to visualise the unfolding scene, rather
than describing isolated moments. However, most automatic methods generate each
AD independently, often resulting in repetitive, incoherent descriptions. To
address this, we propose a training-free method, CoherentAD, that first
generates multiple candidate descriptions for each AD time interval, and then
performs auto-regressive selection across the sequence to form a coherent and
informative narrative. To evaluate AD sequences holistically, we introduce a
sequence-level metric, StoryRecall, which measures how well the predicted ADs
convey the ground truth narrative, alongside repetition metrics that capture
the redundancy across consecutive AD outputs. Our method produces coherent AD
sequences with enhanced narrative understanding, outperforming prior approaches
that rely on independent generations.
Ссылки и действия
Дополнительные ресурсы: