Optimizing Speech Language Models for Acoustic Consistency

2509.26276v1 cs.CL, cs.SD 2025-10-02

Авторы:

Morteza Rohanian, Michael Krauthammer

Abstract

We study speech language models that incorporate semantic initialization and planning losses to achieve robust and consistent generation. Our approach initializes speech tokens with self-supervised features, applies a light alignment loss, and trains with thinning and auxiliary objectives that target robustness and content planning. We train three models: a 0.7B speech-only model, a 1.0B speech-only model, and a 1.0B interleaved model with both text and speech. Acoustic studies show that the speech-only models achieve the highest consistency across speaker, gender, sentiment, room, and background factors, surpassing larger systems. Interleaving improves lexical and syntactic probes and semantic--acoustic alignment but reduces consistency. Linear probes show that our initialization biases the model toward content structure while trading off prosody detail. These results show that LM-side design and training mix control the balance between acoustic stability and semantic grounding without changes to the tokenizer or runtime architecture. A demo and model weights are available for exploration.

Ссылки и действия

Читать на arXiv Скачать PDF

Дополнительные ресурсы:

Optimizing Speech Language Models for Acoustic Consistency

Авторы:

Abstract

Ссылки и действия

Связанные статьи

Dialect Identification Using Resource-Efficient Fine-Tuning Approaches

A new kid on the block: Distributional semantics predicts the word-specific tone...

CLiFT-ASR: A Cross-Lingual Fine-Tuning Framework for Low-Resource Taiwanese Hokk...

POTSA: A Cross-Lingual Speech Alignment Framework for Low Resource Speech-to-Tex...

CantoASR: Prosody-Aware ASR-LALM Collaboration for Low-Resource Cantonese

Навигация