X-Diffusion: Training Diffusion Policies on Cross-Embodiment Human Demonstrations
2511.04671v1
cs.RO, cs.AI, cs.CV
2025-11-08
Авторы:
Maximus A. Pace, Prithwish Dan, Chuanruo Ning, Atiksh Bhardwaj, Audrey Du, Edward W. Duan, Wei-Chiu Ma, Kushal Kedia
Abstract
Human videos can be recorded quickly and at scale, making them an appealing
source of training data for robot learning. However, humans and robots differ
fundamentally in embodiment, resulting in mismatched action execution. Direct
kinematic retargeting of human hand motion can therefore produce actions that
are physically infeasible for robots. Despite these low-level differences,
human demonstrations provide valuable motion cues about how to manipulate and
interact with objects. Our key idea is to exploit the forward diffusion
process: as noise is added to actions, low-level execution differences fade
while high-level task guidance is preserved. We present X-Diffusion, a
principled framework for training diffusion policies that maximally leverages
human data without learning dynamically infeasible motions. X-Diffusion first
trains a classifier to predict whether a noisy action is executed by a human or
robot. Then, a human action is incorporated into policy training only after
adding sufficient noise such that the classifier cannot discern its embodiment.
Actions consistent with robot execution supervise fine-grained denoising at low
noise levels, while mismatched human actions provide only coarse guidance at
higher noise levels. Our experiments show that naive co-training under
execution mismatches degrades policy performance, while X-Diffusion
consistently improves it. Across five manipulation tasks, X-Diffusion achieves
a 16% higher average success rate than the best baseline. The project website
is available at https://portal-cornell.github.io/X-Diffusion/.
Ссылки и действия
Дополнительные ресурсы: