NoTVLA: Narrowing of Dense Action Trajectories for Generalizable Robot Manipulation
2510.03895v1
cs.RO, cs.CV
2025-10-08
Авторы:
Zheng Huang, Mingyu Liu, Xiaoyi Lin, Muzhi Zhu, Canyu Zhao, Zongze Du, Xiaoman Li, Yiduo Jia, Hao Zhong, Hao Chen, Chunhua Shen
Abstract
Vision-Language-Action (VLA) models represent a pivotal advance in embodied
intelligence, yet they confront critical barriers to real-world deployment,
most notably catastrophic forgetting. This issue stems from their overreliance
on continuous action sequences or action chunks, which inadvertently create
isolated data silos that disrupt knowledge retention across tasks. To tackle
these challenges, we propose the Narrowing of Trajectory VLA (NoTVLA)
framework: a novel approach that narrows its focus to sparse trajectories,
thereby avoiding the catastrophic forgetting associated with dense trajectory
fine-tuning. A key innovation of NoTVLA lies in its trajectory planning
strategy: instead of centering on the target object's trajectory, it leverages
temporal compression and spatial reasoning pruning specifically for the robot
end effector's trajectory. Furthermore, training is conducted using these
sparse trajectories rather than dense action trajectories, an optimization that
delivers remarkable practical advantages with better performance in zero-shot.
In multi-task evaluation scenarios, NoTVLA achieves superior performance and
generalization compared to pi0 while operating under two critical constraints:
it uses over an order of magnitude less computing power than pi0 and requires
no wrist-mounted camera. This design ensures that NoTVLA's operational accuracy
closely approximates that of single-task expert models. Crucially, it also
preserves the model's inherent language capabilities, enabling zero-shot
generalization in specific scenarios, supporting unified model deployment
across multiple robot platforms, and fostering a degree of generalization even
when perceiving tasks from novel perspectives.
Ссылки и действия
Дополнительные ресурсы: