RosettaSpeech: Zero-Shot Speech-to-Speech Translation from Monolingual Data

2511.20974v1 eess.AS, cs.CL, cs.LG 2025-11-27

Авторы:

Zhisheng Zheng, Xiaohang Sun, Tuan Dinh, Abhishek Yanamandra, Abhinav Jain, Zhu Liu, Sunil Hadap, Vimal Bhat, Manoj Aggarwal, Gerard Medioni, David Harwath

Abstract

The scarcity of parallel speech corpora critically hampers speech-to-speech translation (S2ST), often forcing reliance on complex, multi-stage pipelines. This paper introduces RosettaSpeech, a novel and simplified framework for zero-shot S2ST that is trained on monolingual speech-text data augmented by machine translation supervision. While our method leverages the linguistic knowledge inherent in text-based NMT models, it strictly eliminates the need for parallel speech-to-speech pairs. Our model uniquely uses text as an intermediate bridge during training but functions as a direct, end-to-end speech-to-speech model at inference. This streamlined approach achieves state-of-the-art results on standard benchmarks. For instance, on the CVSS-C test set, RosettaSpeech outperforms leading systems, achieving an ASR-BLEU score of 25.17 for German-to-English and 29.86 for Spanish-to-English-relative gains of over 27% and 14%, respectively. Furthermore, we demonstrate that a single model can deliver strong many-to-one translation performance (FR/ES/DE -> EN). We also provide a foundational analysis of how training data scaling impacts model performance. By prioritizing reliance on abundant parallel text rather than difficult-to-acquire parallel speech, RosettaSpeech offers a scalable path to creating high-quality, speaker-preserving S2ST for a much broader array of languages.

Ссылки и действия

Читать на arXiv Скачать PDF

Дополнительные ресурсы:

RosettaSpeech: Zero-Shot Speech-to-Speech Translation from Monolingual Data

Авторы:

Abstract

Ссылки и действия

Связанные статьи

DiSTAR: Diffusion over a Scalable Token Autoregressive Representation for Speech...

Two Heads Are Better Than One: Audio-Visual Speech Error Correction with Dual Hy...

DiSTAR: Diffusion over a Scalable Token Autoregressive Representation for Speech...

DarkStream: real-time speech anonymization with low latency

Навигация