The Speech-LLM Takes It All: A Truly Fully End-to-End Spoken Dialogue State Tracking Approach
2510.09424v1
cs.CL, cs.AI, cs.LG, eess.AS
2025-10-14
Авторы:
Nizar El Ghazal, Antoine Caubrière, Valentin Vielzeuf
Abstract
This paper presents a comparative study of context management strategies for
end-to-end Spoken Dialog State Tracking using Speech-LLMs. We systematically
evaluate traditional multimodal context (combining text history and spoken
current turn), full spoken history, and compressed spoken history approaches.
Our experiments on the SpokenWOZ corpus demonstrate that providing the full
spoken conversation as input yields the highest performance among models of
similar size, significantly surpassing prior methods. Furthermore, we show that
attention-pooling-based compression of the spoken history offers a strong
trade-off, maintaining competitive accuracy with reduced context size. Detailed
analysis confirms that improvements stem from more effective context
utilization.