MM-Nav: Multi-View VLA Model for Robust Visual Navigation via Multi-Expert Learning
2510.03142v1
cs.RO, cs.CV
2025-10-07
Авторы:
Tianyu Xu, Jiawei Chen, Jiazhao Zhang, Wenyao Zhang, Zekun Qi, Minghan Li, Zhizheng Zhang, He Wang
Abstract
Visual navigation policy is widely regarded as a promising direction, as it
mimics humans by using egocentric visual observations for navigation. However,
optical information of visual observations is difficult to be explicitly
modeled like LiDAR point clouds or depth maps, which subsequently requires
intelligent models and large-scale data. To this end, we propose to leverage
the intelligence of the Vision-Language-Action (VLA) model to learn diverse
navigation capabilities from synthetic expert data in a teacher-student manner.
Specifically, we implement the VLA model, MM-Nav, as a multi-view VLA (with 360
observations) based on pretrained large language models and visual foundation
models. For large-scale navigation data, we collect expert data from three
reinforcement learning (RL) experts trained with privileged depth information
in three challenging tailor-made environments for different navigation
capabilities: reaching, squeezing, and avoiding. We iteratively train our VLA
model using data collected online from RL experts, where the training ratio is
dynamically balanced based on performance on individual capabilities. Through
extensive experiments in synthetic environments, we demonstrate that our model
achieves strong generalization capability. Moreover, we find that our student
VLA model outperforms the RL teachers, demonstrating the synergistic effect of
integrating multiple capabilities. Extensive real-world experiments further
confirm the effectiveness of our method.
Ссылки и действия
Дополнительные ресурсы: