DiSTAR: Diffusion over a Scalable Token Autoregressive Representation for Speech Generation

2510.12210v2 eess.AS, cs.CL, cs.LG 2025-10-17

Авторы:

Yakun Song, Xiaobin Zhuang, Jiawei Chen, Zhikang Niu, Guanrou Yang, Chenpeng Du, Dongya Jia, Zhuo Chen, Yuping Wang, Yuxuan Wang, Xie Chen

Abstract

Recent attempts to interleave autoregressive (AR) sketchers with diffusion-based refiners over continuous speech representations have shown promise, but they remain brittle under distribution shift and offer limited levers for controllability. We introduce DISTAR, a zero-shot text-to-speech framework that operates entirely in a discrete residual vector quantization (RVQ) code space and tightly couples an AR language model with a masked diffusion model, without forced alignment or a duration predictor. Concretely, DISTAR drafts block-level RVQ tokens with an AR language model and then performs parallel masked-diffusion infilling conditioned on the draft to complete the next block, yielding long-form synthesis with blockwise parallelism while mitigating classic AR exposure bias. The discrete code space affords explicit control at inference: DISTAR produces high-quality audio under both greedy and sample-based decoding using classifier-free guidance, supports trade-offs between robustness and diversity, and enables variable bit-rate and controllable computation via RVQ layer pruning at test time. Extensive experiments and ablations demonstrate that DISTAR surpasses state-of-the-art zero-shot TTS systems in robustness, naturalness, and speaker/style consistency, while maintaining rich output diversity. Audio samples are provided on https://anonymous.4open.science/w/DiSTAR_demo.

Ссылки и действия

Читать на arXiv Скачать PDF

Дополнительные ресурсы:

DiSTAR: Diffusion over a Scalable Token Autoregressive Representation for Speech Generation

Авторы:

Abstract

Ссылки и действия

Связанные статьи

RosettaSpeech: Zero-Shot Speech-to-Speech Translation from Monolingual Data

Two Heads Are Better Than One: Audio-Visual Speech Error Correction with Dual Hy...

DiSTAR: Diffusion over a Scalable Token Autoregressive Representation for Speech...

DarkStream: real-time speech anonymization with low latency

Навигация