DiTSinger: Scaling Singing Voice Synthesis with Diffusion Transformer and Implicit Alignment
2510.09016v1
cs.SD, cs.AI, eess.AS
2025-10-14
Авторы:
Zongcai Du, Guilin Deng, Xiaofeng Guo, Xin Gao, Linke Li, Kaichang Cheng, Fubo Han, Siyu Yang, Peng Liu, Pan Zhong, Qiang Fu
Abstract
Recent progress in diffusion-based Singing Voice Synthesis (SVS) demonstrates
strong expressiveness but remains limited by data scarcity and model
scalability. We introduce a two-stage pipeline: a compact seed set of
human-sung recordings is constructed by pairing fixed melodies with diverse
LLM-generated lyrics, and melody-specific models are trained to synthesize over
500 hours of high-quality Chinese singing data. Building on this corpus, we
propose DiTSinger, a Diffusion Transformer with RoPE and qk-norm,
systematically scaled in depth, width, and resolution for enhanced fidelity.
Furthermore, we design an implicit alignment mechanism that obviates
phoneme-level duration labels by constraining phoneme-to-acoustic attention
within character-level spans, thereby improving robustness under noisy or
uncertain alignments. Extensive experiments validate that our approach enables
scalable, alignment-free, and high-fidelity SVS.
Ссылки и действия
Дополнительные ресурсы: