U-Codec: Ultra Low Frame-rate Neural Speech Codec for Fast High-fidelity Speech Generation
2510.16718v1
cs.SD, cs.CL, cs.LG
2025-10-22
Авторы:
Xusheng Yang, Long Zhou, Wenfu Wang, Kai Hu, Shulin Feng, Chenxing Li, Meng Yu, Dong Yu, Yuexian Zou
Abstract
We propose \textbf{U-Codec}, an \textbf{U}ltra low frame-rate neural speech
\textbf{Codec} that achieves high-fidelity reconstruction and fast speech
generation at an extremely low frame-rate of 5Hz (5 frames per second). Extreme
compression at 5Hz typically leads to severe intelligibility and spectral
detail loss, we introduce a Transformer-based inter-frame long-term dependency
module and systematically explore residual vector quantization (RVQ) depth and
codebook size to identify optimal configurations. Moreover, we apply U-Codec
into a large language model (LLM)-based auto-regressive TTS model, which
leverages global and local hierarchical architecture to effectively capture
dependencies across multi-layer tokens. We extend LLM-based TTS from 3-layer
RVQ at 50Hz to 32-layer RVQ at 5Hz. Experimental results demonstrate that
U-Codec improves LLM-based TTS inference speed by around 3 $\times$ over
high-frame-rate codecs while maintaining similarity and naturalness. These
results validate the feasibility of using highly compressed 5Hz discrete tokens
for fast and high-fidelity speech synthesis.
Ссылки и действия
Дополнительные ресурсы: