Two Heads Are Better Than One: Audio-Visual Speech Error Correction with Dual Hypotheses
2510.13281v1
eess.AS, cs.CL, cs.LG
2025-10-17
Авторы:
Sungnyun Kim, Kangwook Jang, Sungwoo Cho, Joon Son Chung, Hoirin Kim, Se-Young Yun
Abstract
This paper introduces a new paradigm for generative error correction (GER)
framework in audio-visual speech recognition (AVSR) that reasons over
modality-specific evidences directly in the language space. Our framework,
DualHyp, empowers a large language model (LLM) to compose independent N-best
hypotheses from separate automatic speech recognition (ASR) and visual speech
recognition (VSR) models. To maximize the effectiveness of DualHyp, we further
introduce RelPrompt, a noise-aware guidance mechanism that provides
modality-grounded prompts to the LLM. RelPrompt offers the temporal reliability
of each modality stream, guiding the model to dynamically switch its focus
between ASR and VSR hypotheses for an accurate correction. Under various
corruption scenarios, our framework attains up to 57.7% error rate gain on the
LRS2 benchmark over standard ASR baseline, contrary to single-stream GER
approaches that achieve only 10% gain. To facilitate research within our
DualHyp framework, we release the code and the dataset comprising ASR and VSR
hypotheses at https://github.com/sungnyun/dualhyp.
Ссылки и действия
Дополнительные ресурсы: