Language Model Based Text-to-Audio Generation: Anti-Causally Aligned Collaborative Residual Transformers
2510.04577v1
cs.SD, cs.LG, cs.MM, eess.AS
2025-10-08
Авторы:
Juncheng Wang, Chao Xu, Cheng Yu, Zhe Hu, Haoyu Xie, Guoqi Yu, Lei Shang, Shujun Wang
Abstract
While language models (LMs) paired with residual vector quantization (RVQ)
tokenizers have shown promise in text-to-audio (T2A) generation, they still lag
behind diffusion-based models by a non-trivial margin. We identify a critical
dilemma underpinning this gap: incorporating more RVQ layers improves audio
reconstruction fidelity but exceeds the generation capacity of conventional
LMs. To address this, we first analyze RVQ dynamics and uncover two key
limitations: 1) orthogonality of features across RVQ layers hinders effective
LMs training, and 2) descending semantic richness in tokens from deeper RVQ
layers exacerbates exposure bias during autoregressive decoding. Based on these
insights, we propose Siren, a novel LM-based framework that employs multiple
isolated transformers with causal conditioning and anti-causal alignment via
reinforcement learning. Extensive experiments demonstrate that Siren
outperforms both existing LM-based and diffusion-based T2A systems, achieving
state-of-the-art results. By bridging the representational strengths of LMs
with the fidelity demands of audio synthesis, our approach repositions LMs as
competitive contenders against diffusion models in T2A tasks. Moreover, by
aligning audio representations with linguistic structures, Siren facilitates a
promising pathway toward unified multi-modal generation frameworks.