Detecting and Mitigating Insertion Hallucination in Video-to-Audio Generation

2510.08078v2 cs.SD, cs.LG 2025-10-14

Авторы:

Liyang Chen, Hongkai Chen, Yujun Cai, Sifan Li, Qingwen Ye, Yiwei Wang

Abstract

Video-to-Audio generation has made remarkable strides in automatically synthesizing sound for video. However, existing evaluation metrics, which focus on semantic and temporal alignment, overlook a critical failure mode: models often generate acoustic events, particularly speech and music, that have no corresponding visual source. We term this phenomenon Insertion Hallucination and identify it as a systemic risk driven by dataset biases, such as the prevalence of off-screen sounds, that remains completely undetected by current metrics. To address this challenge, we first develop a systematic evaluation framework that employs a majority-voting ensemble of multiple audio event detectors. We also introduce two novel metrics to quantify the prevalence and severity of this issue: IH@vid (the fraction of videos with hallucinations) and IH@dur (the fraction of hallucinated duration). Building on this, we propose Posterior Feature Correction, a novel training-free inference-time method that mitigates IH. PFC operates in a two-pass process: it first generates an initial audio output to detect hallucinated segments, and then regenerates the audio after masking the corresponding video features at those timestamps. Experiments on several mainstream V2A benchmarks first reveal that state-of-the-art models suffer from severe IH. In contrast, our PFC method reduces both the prevalence and duration of hallucinations by over 50\% on average, without degrading, and in some cases even improving, conventional metrics for audio quality and temporal synchronization. Our work is the first to formally define, systematically measure, and effectively mitigate Insertion Hallucination, paving the way for more reliable and faithful V2A models.

Ссылки и действия

Читать на arXiv Скачать PDF

Дополнительные ресурсы:

Detecting and Mitigating Insertion Hallucination in Video-to-Audio Generation

Авторы:

Abstract

Ссылки и действия

Связанные статьи

Contract-Driven QoE Auditing for Speech and Singing Services: From MOS Regressio...

Generative Multi-modal Feedback for Singing Voice Synthesis Evaluation

Differentiable Attenuation Filters for Feedback Delay Networks

DHAuDS: A Dynamic and Heterogeneous Audio Benchmark for Test-Time Adaptation

Count The Notes: Histogram-Based Supervision for Automatic Music Transcription

Навигация