Listening or Reading? Evaluating Speech Awareness in Chain-of-Thought Speech-to-Text Translation
2510.03115v1
cs.CL, cs.SD
2025-10-07
Авторы:
Jacobo Romero-Díaz, Gerard I. Gállego, Oriol Pareras, Federico Costa, Javier Hernando, Cristina España-Bonet
Abstract
Speech-to-Text Translation (S2TT) systems built from Automatic Speech
Recognition (ASR) and Text-to-Text Translation (T2TT) modules face two major
limitations: error propagation and the inability to exploit prosodic or other
acoustic cues. Chain-of-Thought (CoT) prompting has recently been introduced,
with the expectation that jointly accessing speech and transcription will
overcome these issues. Analyzing CoT through attribution methods, robustness
evaluations with corrupted transcripts, and prosody-awareness, we find that it
largely mirrors cascaded behavior, relying mainly on transcripts while barely
leveraging speech. Simple training interventions, such as adding Direct S2TT
data or noisy transcript injection, enhance robustness and increase speech
attribution. These findings challenge the assumed advantages of CoT and
highlight the need for architectures that explicitly integrate acoustic
information into translation.
Ссылки и действия
Дополнительные ресурсы: