ContextVLA: Vision-Language-Action Model with Amortized Multi-Frame Context
2510.04246v1
cs.RO, cs.AI
2025-10-08
Авторы:
Huiwon Jang, Sihyun Yu, Heeseung Kwon, Hojin Jeon, Younggyo Seo, Jinwoo Shin
Abstract
Leveraging temporal context is crucial for success in partially observable
robotic tasks. However, prior work in behavior cloning has demonstrated
inconsistent performance gains when using multi-frame observations. In this
paper, we introduce ContextVLA, a policy model that robustly improves robotic
task performance by effectively leveraging multi-frame observations. Our
approach is motivated by the key observation that Vision-Language-Action models
(VLA), i.e., policy models built upon a Vision-Language Model (VLM), more
effectively utilize multi-frame observations for action generation. This
suggests that VLMs' inherent temporal understanding capability enables them to
extract more meaningful context from multi-frame observations. However, the
high dimensionality of video inputs introduces significant computational
overhead, making VLA training and inference inefficient. To address this,
ContextVLA compresses past observations into a single context token, allowing
the policy to efficiently leverage temporal context for action generation. Our
experiments show that ContextVLA consistently improves over single-frame VLAs
and achieves the benefits of full multi-frame training but with reduced
training and inference times.
Ссылки и действия
Дополнительные ресурсы: