Bridge Thinking and Acting: Unleashing Physical Potential of VLM with Generalizable Action Expert
2510.03896v1
cs.CV, cs.RO
2025-10-08
Авторы:
Mingyu Liu, Zheng Huang, Xiaoyi Lin, Muzhi Zhu, Canyu Zhao, Zongze Du, Yating Wang, Haoyi Zhu, Hao Chen, Chunhua Shen
Abstract
Although Vision-Language Models (VLM) have demonstrated impressive planning
and reasoning capabilities, translating these abilities into the physical world
introduces significant challenges. Conventional Vision-Language-Action (VLA)
models, which integrate reasoning and action into a monolithic architecture,
generalize poorly because they are constrained by scarce, narrow-domain data.
While recent dual-system approaches attempt to decouple "thinking" from
"acting", they are often constrained by semantic ambiguities within the action
module. This ambiguity makes large-scale, cross-task training infeasible.
Consequently, these systems typically necessitate fine-tuning on newly
collected data when deployed to novel environments, and the cooperation
mechanism between the two systems remains ill-defined. To address these
limitations, we introduce, for the first time, a framework centered around a
generalizable action expert. Our approach utilizes sparse 3D trajectories as an
intermediate representation, effectively bridging the high-level planning
capabilities of the VLM with the low-level physical action module. During the
planning phase, the VLM is only required to generate coarse 3D waypoints. These
waypoints are then processed by our generalizable action expert, which refines
them into dense, executable action sequences by sampling real-time point cloud
observations of the environment. To promote training efficiency and robust
generalization, we introduce a novel "Action Pre-training, Pointcloud
Fine-tuning" paradigm. Our method combines the broad generalization
capabilities of VLMs in visual understanding and planning with the
fine-grained, action-level generalization of action expert.
Ссылки и действия
Дополнительные ресурсы: