TwinVLA: Data-Efficient Bimanual Manipulation with Twin Single-Arm Vision-Language-Action Models
2511.05275v1
cs.RO, cs.LG
2025-11-11
Авторы:
Hokyun Im, Euijin Jeong, Jianlong Fu, Andrey Kolobov, Youngwoon Lee
Abstract
Vision-language-action models (VLAs) trained on large-scale robotic datasets
have demonstrated strong performance on manipulation tasks, including bimanual
tasks. However, because most public datasets focus on single-arm
demonstrations, adapting VLAs for bimanual tasks typically requires substantial
additional bimanual data and fine-tuning. To address this challenge, we
introduce TwinVLA, a modular framework that composes two copies of a pretrained
single-arm VLA into a coordinated bimanual VLA. Unlike monolithic
cross-embodiment models trained on mixtures of single-arm and bimanual data,
TwinVLA improves both data efficiency and performance by composing pretrained
single-arm policies. Across diverse bimanual tasks in real-world and simulation
settings, TwinVLA outperforms a comparably-sized monolithic RDT-1B model
without requiring any bimanual pretraining. Furthermore, it narrows the gap to
state-of-the-art model, $\pi_0$ which rely on extensive proprietary bimanual
data and compute cost. These results establish our modular composition approach
as a data-efficient and scalable path toward high-performance bimanual
manipulation, leveraging public single-arm data.
Ссылки и действия
Дополнительные ресурсы: