Detecting and Mitigating Insertion Hallucination in Video-to-Audio Generation
2510.08078v2
cs.SD, cs.LG
2025-10-14
Авторы:
Liyang Chen, Hongkai Chen, Yujun Cai, Sifan Li, Qingwen Ye, Yiwei Wang
Abstract
Video-to-Audio generation has made remarkable strides in automatically
synthesizing sound for video. However, existing evaluation metrics, which focus
on semantic and temporal alignment, overlook a critical failure mode: models
often generate acoustic events, particularly speech and music, that have no
corresponding visual source. We term this phenomenon Insertion Hallucination
and identify it as a systemic risk driven by dataset biases, such as the
prevalence of off-screen sounds, that remains completely undetected by current
metrics. To address this challenge, we first develop a systematic evaluation
framework that employs a majority-voting ensemble of multiple audio event
detectors. We also introduce two novel metrics to quantify the prevalence and
severity of this issue: IH@vid (the fraction of videos with hallucinations) and
IH@dur (the fraction of hallucinated duration). Building on this, we propose
Posterior Feature Correction, a novel training-free inference-time method that
mitigates IH. PFC operates in a two-pass process: it first generates an initial
audio output to detect hallucinated segments, and then regenerates the audio
after masking the corresponding video features at those timestamps. Experiments
on several mainstream V2A benchmarks first reveal that state-of-the-art models
suffer from severe IH. In contrast, our PFC method reduces both the prevalence
and duration of hallucinations by over 50\% on average, without degrading, and
in some cases even improving, conventional metrics for audio quality and
temporal synchronization. Our work is the first to formally define,
systematically measure, and effectively mitigate Insertion Hallucination,
paving the way for more reliable and faithful V2A models.
Ссылки и действия
Дополнительные ресурсы: