Unified World Models: Memory-Augmented Planning and Foresight for Visual Navigation
2510.08713v1
cs.AI, cs.CV, cs.RO
2025-10-14
Авторы:
Yifei Dong, Fengyi Wu, Guangyu Chen, Zhi-Qi Cheng, Qiyu Hu, Yuxuan Zhou, Jingdong Sun, Jun-Yan He, Qi Dai, Alexander G Hauptmann
Abstract
Enabling embodied agents to effectively imagine future states is critical for
robust and generalizable visual navigation. Current state-of-the-art
approaches, however, adopt modular architectures that separate navigation
planning from visual world modeling, leading to state-action misalignment and
limited adaptability in novel or dynamic scenarios. To overcome this
fundamental limitation, we propose UniWM, a unified, memory-augmented world
model integrating egocentric visual foresight and planning within a single
multimodal autoregressive backbone. Unlike modular frameworks, UniWM explicitly
grounds action decisions in visually imagined outcomes, ensuring tight
alignment between prediction and control. A hierarchical memory mechanism
further integrates detailed short-term perceptual cues with longer-term
trajectory context, enabling stable, coherent reasoning over extended horizons.
Extensive experiments across four challenging benchmarks (Go Stanford, ReCon,
SCAND, HuRoN) demonstrate that UniWM substantially improves navigation success
rates by up to 30%, significantly reduces trajectory errors compared to strong
baselines, and exhibits impressive zero-shot generalization on the unseen
TartanDrive dataset. These results highlight UniWM as a principled step toward
unified, imagination-driven embodied navigation.
Ссылки и действия
Дополнительные ресурсы: