Expertise need not monopolize: Action-Specialized Mixture of Experts for Vision-Language-Action Learning

2510.14300v1 cs.RO, cs.AI 2025-10-18

Авторы:

Weijie Shen, Yitian Liu, Yuhao Wu, Zhixuan Liang, Sijia Gu, Dehui Wang, Tian Nian, Lei Xu, Yusen Qin, Jiangmiao Pang, Xinping Guan, Xiaokang Yang, Yao Mu

Abstract

Vision-Language-Action (VLA) models are experiencing rapid development and demonstrating promising capabilities in robotic manipulation tasks. However, scaling up VLA models presents several critical challenges: (1) Training new VLA models from scratch demands substantial computational resources and extensive datasets. Given the current scarcity of robot data, it becomes particularly valuable to fully leverage well-pretrained VLA model weights during the scaling process. (2) Real-time control requires carefully balancing model capacity with computational efficiency. To address these challenges, We propose AdaMoE, a Mixture-of-Experts (MoE) architecture that inherits pretrained weights from dense VLA models, and scales up the action expert by substituting the feedforward layers into sparsely activated MoE layers. AdaMoE employs a decoupling technique that decouples expert selection from expert weighting through an independent scale adapter working alongside the traditional router. This enables experts to be selected based on task relevance while contributing with independently controlled weights, allowing collaborative expert utilization rather than winner-takes-all dynamics. Our approach demonstrates that expertise need not monopolize. Instead, through collaborative expert utilization, we can achieve superior performance while maintaining computational efficiency. AdaMoE consistently outperforms the baseline model across key benchmarks, delivering performance gains of 1.8% on LIBERO and 9.3% on RoboTwin. Most importantly, a substantial 21.5% improvement in real-world experiments validates its practical effectiveness for robotic manipulation tasks.

Ссылки и действия

Читать на arXiv Скачать PDF

Дополнительные ресурсы:

Expertise need not monopolize: Action-Specialized Mixture of Experts for Vision-Language-Action Learning

Авторы:

Abstract

Ссылки и действия

Связанные статьи

Open-Ended Goal Inference through Actions and Language for Human-Robot Collabora...

Using Machine Learning to Take Stay-or-Go Decisions in Data-driven Drone Mission...

CRAFT-E: A Neuro-Symbolic Framework for Embodied Affordance Grounding

World Models for Autonomous Navigation of Terrestrial Robots from LIDAR Observat...

A Learning-based Control Methodology for Transitioning VTOL UAVs

Навигация