ECTSpeech: Enhancing Efficient Speech Synthesis via Easy Consistency Tuning
2510.05984v1
cs.SD, cs.AI, eess.AS
2025-10-09
Авторы:
Tao Zhu, Yinfeng Yu, Liejun Wang, Fuchun Sun, Wendong Zheng
Abstract
Diffusion models have demonstrated remarkable performance in speech
synthesis, but typically require multi-step sampling, resulting in low
inference efficiency. Recent studies address this issue by distilling diffusion
models into consistency models, enabling efficient one-step generation.
However, these approaches introduce additional training costs and rely heavily
on the performance of pre-trained teacher models. In this paper, we propose
ECTSpeech, a simple and effective one-step speech synthesis framework that, for
the first time, incorporates the Easy Consistency Tuning (ECT) strategy into
speech synthesis. By progressively tightening consistency constraints on a
pre-trained diffusion model, ECTSpeech achieves high-quality one-step
generation while significantly reducing training complexity. In addition, we
design a multi-scale gate module (MSGate) to enhance the denoiser's ability to
fuse features at different scales. Experimental results on the LJSpeech dataset
demonstrate that ECTSpeech achieves audio quality comparable to
state-of-the-art methods under single-step sampling, while substantially
reducing the model's training cost and complexity.
Ссылки и действия
Дополнительные ресурсы: