Chain-of-Thought Reasoning in Streaming Full-Duplex End-to-End Spoken Dialogue Systems
2510.02066v1
cs.CL, cs.SD, eess.AS
2025-10-04
Авторы:
Siddhant Arora, Jinchuan Tian, Hayato Futami, Jiatong Shi, Yosuke Kashiwagi, Emiru Tsunoo, Shinji Watanabe
Abstract
Most end-to-end (E2E) spoken dialogue systems (SDS) rely on voice activity
detection (VAD) for turn-taking, but VAD fails to distinguish between pauses
and turn completions. Duplex SDS models address this by predicting output
continuously, including silence tokens, thus removing the need for explicit
VAD. However, they often have complex dual-channel architecture and lag behind
cascaded models in semantic reasoning. To overcome these challenges, we propose
SCoT: a Streaming Chain-of-Thought (CoT) framework for Duplex SDS, alternating
between processing fixed-duration user input and generating responses in a
blockwise manner. Using frame-level alignments, we create intermediate
targets-aligned user transcripts and system responses for each block.
Experiments show that our approach produces more coherent and interpretable
responses than existing duplex methods while supporting lower-latency and
overlapping interactions compared to turn-by-turn systems.
Ссылки и действия
Дополнительные ресурсы: