Language-Conditioned Representations and Mixture-of-Experts Policy for Robust Multi-Task Robotic Manipulation
2510.24055v1
cs.RO, cs.LG
2025-10-30
Авторы:
Xiucheng Zhang, Yang Jiang, Hongwei Qing, Jiashuo Bai
Abstract
Perceptual ambiguity and task conflict limit multitask robotic manipulation
via imitation learning. We propose a framework combining a Language-Conditioned
Visual Representation (LCVR) module and a Language-conditioned
Mixture-ofExperts Density Policy (LMoE-DP). LCVR resolves perceptual
ambiguities by grounding visual features with language instructions, enabling
differentiation between visually similar tasks. To mitigate task conflict,
LMoE-DP uses a sparse expert architecture to specialize in distinct, multimodal
action distributions, stabilized by gradient modulation. On real-robot
benchmarks, LCVR boosts Action Chunking with Transformers (ACT) and Diffusion
Policy (DP) success rates by 33.75% and 25%, respectively. The full framework
achieves a 79% average success, outperforming the advanced baseline by 21%. Our
work shows that combining semantic grounding and expert specialization enables
robust, efficient multi-task manipulation
Ссылки и действия
Дополнительные ресурсы: