From Language to Locomotion: Retargeting-free Humanoid Control via Motion Latent Guidance
2510.14952v2
cs.RO, cs.CV
2025-10-20
Авторы:
Zhe Li, Cheng Chi, Yangyang Wei, Boan Zhu, Yibo Peng, Tao Huang, Pengwei Wang, Zhongyuan Wang, Shanghang Zhang, Chang Xu
Abstract
Natural language offers a natural interface for humanoid robots, but existing
language-guided humanoid locomotion pipelines remain cumbersome and
untrustworthy. They typically decode human motion, retarget it to robot
morphology, and then track it with a physics-based controller. However, this
multi-stage process is prone to cumulative errors, introduces high latency, and
yields weak coupling between semantics and control. These limitations call for
a more direct pathway from language to action, one that eliminates fragile
intermediate stages. Therefore, we present RoboGhost, a retargeting-free
framework that directly conditions humanoid policies on language-grounded
motion latents. By bypassing explicit motion decoding and retargeting,
RoboGhost enables a diffusion-based policy to denoise executable actions
directly from noise, preserving semantic intent and supporting fast, reactive
control. A hybrid causal transformer-diffusion motion generator further ensures
long-horizon consistency while maintaining stability and diversity, yielding
rich latent representations for precise humanoid behavior. Extensive
experiments demonstrate that RoboGhost substantially reduces deployment
latency, improves success rates and tracking precision, and produces smooth,
semantically aligned locomotion on real humanoids. Beyond text, the framework
naturally extends to other modalities such as images, audio, and music,
providing a universal foundation for vision-language-action humanoid systems.
Ссылки и действия
Дополнительные ресурсы: