LTA-L2S: Lexical Tone-Aware Lip-to-Speech Synthesis for Mandarin with Cross-Lingual Transfer Learning
2509.25670v1
cs.SD, cs.CV
2025-10-02
Авторы:
Kang Yang, Yifan Liang, Fangkun Liu, Zhenping Xie, Chengshi Zheng
Abstract
Lip-to-speech (L2S) synthesis for Mandarin is a significant challenge,
hindered by complex viseme-to-phoneme mappings and the critical role of lexical
tones in intelligibility. To address this issue, we propose Lexical Tone-Aware
Lip-to-Speech (LTA-L2S). To tackle viseme-to-phoneme complexity, our model
adapts an English pre-trained audio-visual self-supervised learning (SSL) model
via a cross-lingual transfer learning strategy. This strategy not only
transfers universal knowledge learned from extensive English data to the
Mandarin domain but also circumvents the prohibitive cost of training such a
model from scratch. To specifically model lexical tones and enhance
intelligibility, we further employ a flow-matching model to generate the F0
contour. This generation process is guided by ASR-fine-tuned SSL speech units,
which contain crucial suprasegmental information. The overall speech quality is
then elevated through a two-stage training paradigm, where a flow-matching
postnet refines the coarse spectrogram from the first stage. Extensive
experiments demonstrate that LTA-L2S significantly outperforms existing methods
in both speech intelligibility and tonal accuracy.
Ссылки и действия
Дополнительные ресурсы: