HMVLM: Human Motion-Vision-Lanuage Model via MoE LoRA
2511.01463v1
cs.CV, cs.AI, cs.GR, 68T45, I.2.10; I.3.7
2025-11-06
Авторы:
Lei Hu, Yongjing Ye, Shihong Xia
Abstract
The expansion of instruction-tuning data has enabled foundation language
models to exhibit improved instruction adherence and superior performance
across diverse downstream tasks. Semantically-rich 3D human motion is being
progressively integrated with these foundation models to enhance multimodal
understanding and cross-modal generation capabilities. However, the modality
gap between human motion and text raises unresolved concerns about catastrophic
forgetting during this integration. In addition, developing
autoregressive-compatible pose representations that preserve generalizability
across heterogeneous downstream tasks remains a critical technical barrier. To
address these issues, we propose the Human Motion-Vision-Language Model
(HMVLM), a unified framework based on the Mixture of Expert Low-Rank
Adaption(MoE LoRA) strategy. The framework leverages the gating network to
dynamically allocate LoRA expert weights based on the input prompt, enabling
synchronized fine-tuning of multiple tasks. To mitigate catastrophic forgetting
during instruction-tuning, we introduce a novel zero expert that preserves the
pre-trained parameters for general linguistic tasks. For pose representation,
we implement body-part-specific tokenization by partitioning the human body
into different joint groups, enhancing the spatial resolution of the
representation. Experiments show that our method effectively alleviates
knowledge forgetting during instruction-tuning and achieves remarkable
performance across diverse human motion downstream tasks.