Scalable Vision-Language-Action Model Pretraining for Robotic Manipulation with Real-Life Human Activity Videos
2510.21571v1
cs.RO, cs.AI, cs.CV, cs.LG
2025-10-28
Авторы:
Qixiu Li, Yu Deng, Yaobo Liang, Lin Luo, Lei Zhou, Chengtang Yao, Lingqi Zeng, Zhiyuan Feng, Huizhi Liang, Sicheng Xu, Yizhong Zhang, Xi Chen, Hao Chen, Lily Sun, Dong Chen, Jiaolong Yang, Baining Guo
Abstract
This paper presents a novel approach for pretraining robotic manipulation
Vision-Language-Action (VLA) models using a large corpus of unscripted
real-life video recordings of human hand activities. Treating human hand as
dexterous robot end-effector, we show that "in-the-wild" egocentric human
videos without any annotations can be transformed into data formats fully
aligned with existing robotic V-L-A training data in terms of task granularity
and labels. This is achieved by the development of a fully-automated holistic
human activity analysis approach for arbitrary human hand videos. This approach
can generate atomic-level hand activity segments and their language
descriptions, each accompanied with framewise 3D hand motion and camera motion.
We process a large volume of egocentric videos and create a hand-VLA training
dataset containing 1M episodes and 26M frames. This training data covers a wide
range of objects and concepts, dexterous manipulation tasks, and environment
variations in real life, vastly exceeding the coverage of existing robot data.
We design a dexterous hand VLA model architecture and pretrain the model on
this dataset. The model exhibits strong zero-shot capabilities on completely
unseen real-world observations. Additionally, fine-tuning it on a small amount
of real robot action data significantly improves task success rates and
generalization to novel objects in real robotic experiments. We also
demonstrate the appealing scaling behavior of the model's task performance with
respect to pretraining data scale. We believe this work lays a solid foundation
for scalable VLA pretraining, advancing robots toward truly generalizable
embodied intelligence.