Listening or Reading? Evaluating Speech Awareness in Chain-of-Thought Speech-to-Text Translation

2510.03115v1 cs.CL, cs.SD 2025-10-07

Авторы:

Jacobo Romero-Díaz, Gerard I. Gállego, Oriol Pareras, Federico Costa, Javier Hernando, Cristina España-Bonet

Abstract

Speech-to-Text Translation (S2TT) systems built from Automatic Speech Recognition (ASR) and Text-to-Text Translation (T2TT) modules face two major limitations: error propagation and the inability to exploit prosodic or other acoustic cues. Chain-of-Thought (CoT) prompting has recently been introduced, with the expectation that jointly accessing speech and transcription will overcome these issues. Analyzing CoT through attribution methods, robustness evaluations with corrupted transcripts, and prosody-awareness, we find that it largely mirrors cascaded behavior, relying mainly on transcripts while barely leveraging speech. Simple training interventions, such as adding Direct S2TT data or noisy transcript injection, enhance robustness and increase speech attribution. These findings challenge the assumed advantages of CoT and highlight the need for architectures that explicitly integrate acoustic information into translation.

Ссылки и действия

Читать на arXiv Скачать PDF

Дополнительные ресурсы:

Listening or Reading? Evaluating Speech Awareness in Chain-of-Thought Speech-to-Text Translation

Авторы:

Abstract

Ссылки и действия

Связанные статьи

Dialect Identification Using Resource-Efficient Fine-Tuning Approaches

A new kid on the block: Distributional semantics predicts the word-specific tone...

CLiFT-ASR: A Cross-Lingual Fine-Tuning Framework for Low-Resource Taiwanese Hokk...

POTSA: A Cross-Lingual Speech Alignment Framework for Low Resource Speech-to-Tex...

CantoASR: Prosody-Aware ASR-LALM Collaboration for Low-Resource Cantonese

Навигация