End-to-end Listen, Look, Speak and Act
2510.16756v1
cs.AI, cs.CL, cs.CV, cs.RO, eess.AS
2025-10-22
Авторы:
Siyin Wang, Wenyi Yu, Xianzhao Chen, Xiaohai Tian, Jun Zhang, Lu Lu, Chao Zhang
Abstract
Human interaction is inherently multimodal and full-duplex: we listen while
watching, speak while acting, and fluidly adapt to turn-taking and
interruptions. Realizing these capabilities is essential for building models
simulating humans. We present ELLSA (End-to-end Listen, Look, Speak and Act),
which, to our knowledge, is the first full-duplex, end-to-end model that
simultaneously perceives and generates across vision, text, speech, and action
within a single architecture, enabling interaction patterns previously out of
reach, yielding more natural, human-like behaviors. At its core is a novel
SA-MoE architecture (Self-Attention Mixture-of-Experts) that routes each
modality to specialized experts and fuses them through a unified attention
backbone. This provides a generalizable solution for joint multimodal
perception and concurrent generation, leveraging strong pre-trained components
while enabling efficient modality integration and mitigating modality
interference. On speech-interaction and robot-manipulation benchmarks, ELLSA
matches modality-specific baselines, while uniquely supporting advanced
multimodal and full-duplex behaviors such as dialogue and action turn-taking,
defective instruction rejection, speaking-while-acting, context-grounded visual
question answering, and action barge-ins. We contend that ELLSA represents a
step toward more natural and general interactive intelligence, contributing to
the broader pursuit of artificial general intelligence. All data, code and
model checkpoints will be released upon acceptance.