dVLA: Diffusion Vision-Language-Action Model with Multimodal Chain-of-Thought
2509.25681v1
cs.RO, cs.CV
2025-10-02
Авторы:
Junjie Wen, Minjie Zhu, Jiaming Liu, Zhiyuan Liu, Yicun Yang, Linfeng Zhang, Shanghang Zhang, Yichen Zhu, Yi Xu
Abstract
Vision-Language-Action (VLA) models are emerging as a next-generation
paradigm for robotics. We introduce dVLA, a diffusion-based VLA that leverages
a multimodal chain-of-thought to unify visual perception, language reasoning,
and robotic control in a single system. dVLA jointly optimizes perception,
language understanding, and action under a single diffusion objective, enabling
stronger cross-modal reasoning and better generalization to novel instructions
and objects. For practical deployment, we mitigate inference latency by
incorporating two acceleration strategies, a prefix attention mask and KV
caching, yielding up to around times speedup at test-time inference. We
evaluate dVLA in both simulation and the real world: on the LIBERO benchmark,
it achieves state-of-the-art performance with a 96.4% average success rate,
consistently surpassing both discrete and continuous action policies; on a real
Franka robot, it succeeds across a diverse task suite, including a challenging
bin-picking task that requires multi-step planning, demonstrating robust
real-world performance. Together, these results underscore the promise of
unified diffusion frameworks for practical, high-performance VLA robotics.
Ссылки и действия
Дополнительные ресурсы: