MetaVLA: Unified Meta Co-training For Efficient Embodied Adaption
2510.05580v1
cs.AI, cs.RO
2025-10-09
Авторы:
Chen Li, Zhantao Yang, Han Zhang, Fangyi Chen, Chenchen Zhu, Anudeepsekhar Bolimera, Marios Savvides
Abstract
Vision-Language-Action (VLA) models show promise in embodied reasoning, yet
remain far from true generalists-they often require task-specific fine-tuning,
and generalize poorly to unseen tasks. We propose MetaVLA, a unified,
backbone-agnostic post-training framework for efficient and scalable alignment.
MetaVLA introduces Context-Aware Meta Co-Training, which consolidates diverse
target tasks into a single fine-tuning stage while leveraging structurally
diverse auxiliary tasks to improve in-domain generalization. Unlike naive
multi-task SFT, MetaVLA integrates a lightweight meta-learning
mechanism-derived from Attentive Neural Processes-to enable rapid adaptation
from diverse contexts with minimal architectural change or inference overhead.
On the LIBERO benchmark, MetaVLA with six auxiliary tasks outperforms OpenVLA
by up to 8.0% on long-horizon tasks, reduces training steps from 240K to 75K,
and cuts GPU time by ~76%. These results show that scalable, low-resource
post-training is achievable-paving the way toward general-purpose embodied
agents. Code will be available.
Ссылки и действия
Дополнительные ресурсы: