Can Speech LLMs Think while Listening?
2510.07497v1
cs.CL, cs.AI, eess.AS
2025-10-11
Авторы:
Yi-Jen Shih, Desh Raj, Chunyang Wu, Wei Zhou, SK Bong, Yashesh Gaur, Jay Mahadeokar, Ozlem Kalinli, Mike Seltzer
Abstract
Recent advances in speech large language models (speech LLMs) have enabled
seamless spoken interactions, but these systems still struggle with complex
reasoning tasks. Previously, chain-of-thought (CoT) prompting or fine-tuning
has been to shown to significantly improve the reasoning abilities of
text-based LLMs. In this work, we investigate the effect of CoT fine-tuning for
multi-stream speech LLMs, demonstrating that reasoning in text space improves
the accuracy of speech LLMs by 2.4x, on average, over a suite of spoken
reasoning tasks. Beyond accuracy, the latency of the spoken response is a
crucial factor for interacting with voice-based agents. Inspired by the human
behavior of "thinking while listening," we propose methods to reduce the
additional latency from reasoning by allowing the model to start reasoning
before the user query has ended. To achieve this, we introduce an entropy-based
metric, "question completeness," which acts as an indicator to guide the model
on the optimal time to start reasoning. This method provides greater control
over the accuracy-latency trade-off compared with heuristic-based approaches
and, under equivalent latency conditions, yields a 4% accuracy gain on
ARC-Easy. Finally, we use Direct Preference Optimization (DPO) on preference
data created using rejection sampling to push the accuracy-latency pareto
frontier further, resulting in a 70% reduction in latency without loss in
accuracy.