Latent Speech-Text Transformer
2510.06195v1
cs.CL, cs.AI, cs.LG, eess.AS
2025-10-09
Авторы:
Yen-Ju Lu, Yashesh Gaur, Wei Zhou, Benjamin Muller, Jesus Villalba, Najim Dehak, Luke Zettlemoyer, Gargi Ghosh, Mike Lewis, Srinivasan Iyer, Duc Le
Abstract
Auto-regressive speech-text models are typically pre-trained on a large
number of interleaved sequences of text tokens and raw speech encoded as speech
tokens using vector quantization. These models have demonstrated
state-of-the-art performance in speech-to-speech understanding and generation
benchmarks, together with promising scaling laws, primarily enabled by the
representational alignment between text and speech. Nevertheless, they suffer
from shortcomings, partly owing to the disproportionately longer sequences of
speech tokens in contrast to textual tokens. This results in a large compute
imbalance between modalities during pre-training as well as during inference,
and a potential hindrance to effectively aligning speech and text, ultimately
translating to several orders of magnitude slower scaling laws. We introduce
the Latent Speech-Text Transformer (LST), which makes pre-training speech-text
models more data-efficient by dynamically and inexpensively aggregating speech
tokens into latent speech patches. These patches serve as higher-level units
that can either align with corresponding textual units to aid capability
transfer or even encapsulate common speech sequences like silences to be more
compute-efficient. We show that LST outperforms vanilla approaches on
speech-to-speech as well as text-to-text benchmarks in both data- and
compute-controlled settings, the former indicating more effective
representational alignment and the latter indicating steeper scaling laws for
speech-text models. On HellaSwag story completion, LST achieves 6.5% absolute
gain in speech accuracy under compute-controlled training and 5.3% under
data-controlled training, while also improving text performance. We will
release our models, code, and the evaluation data to facilitate further
research.