EmbodiSwap for Zero-Shot Robot Imitation Learning

2510.03706v1 cs.RO, cs.AI, cs.CV, cs.LG 2025-10-08

Авторы:

Eadom Dessalene, Pavan Mantripragada, Michael Maynord, Yiannis Aloimonos

Abstract

We introduce EmbodiSwap - a method for producing photorealistic synthetic robot overlays over human video. We employ EmbodiSwap for zero-shot imitation learning, bridging the embodiment gap between in-the-wild ego-centric human video and a target robot embodiment. We train a closed-loop robot manipulation policy over the data produced by EmbodiSwap. We make novel use of V-JEPA as a visual backbone, repurposing V-JEPA from the domain of video understanding to imitation learning over synthetic robot videos. Adoption of V-JEPA outperforms alternative vision backbones more conventionally used within robotics. In real-world tests, our zero-shot trained V-JEPA model achieves an $82\%$ success rate, outperforming a few-shot trained $\pi_0$ network as well as $\pi_0$ trained over data produced by EmbodiSwap. We release (i) code for generating the synthetic robot overlays which takes as input human videos and an arbitrary robot URDF and generates a robot dataset, (ii) the robot dataset we synthesize over EPIC-Kitchens, HOI4D and Ego4D, and (iii) model checkpoints and inference code, to facilitate reproducible research and broader adoption.

Ссылки и действия

Читать на arXiv Скачать PDF

Дополнительные ресурсы:

EmbodiSwap for Zero-Shot Robot Imitation Learning

Авторы:

Abstract

Ссылки и действия

Связанные статьи

RoaD: Rollouts as Demonstrations for Closed-Loop Supervised Fine-Tuning of Auton...

EfficientFlow: Efficient Equivariant Flow Policy Learning for Embodied AI

$\mathcal{E}_0$: Enhancing Generalization and Fine-Grained Control in VLA Models...

SCOUT: A Lightweight Framework for Scenario Coverage Assessment in Autonomous Dr...

RobotArena $\infty$: Scalable Robot Benchmarking via Real-to-Sim Translation

Навигация