DexMan: Learning Bimanual Dexterous Manipulation from Human and Generated Videos
2510.08475v1
cs.RO, cs.CV, cs.LG
2025-10-11
Авторы:
Jhen Hsieh, Kuan-Hsun Tu, Kuo-Han Hung, Tsung-Wei Ke
Abstract
We present DexMan, an automated framework that converts human visual
demonstrations into bimanual dexterous manipulation skills for humanoid robots
in simulation. Operating directly on third-person videos of humans manipulating
rigid objects, DexMan eliminates the need for camera calibration, depth
sensors, scanned 3D object assets, or ground-truth hand and object motion
annotations. Unlike prior approaches that consider only simplified floating
hands, it directly controls a humanoid robot and leverages novel contact-based
rewards to improve policy learning from noisy hand-object poses estimated from
in-the-wild videos.
DexMan achieves state-of-the-art performance in object pose estimation on the
TACO benchmark, with absolute gains of 0.08 and 0.12 in ADD-S and VSD.
Meanwhile, its reinforcement learning policy surpasses previous methods by 19%
in success rate on OakInk-v2. Furthermore, DexMan can generate skills from both
real and synthetic videos, without the need for manual data collection and
costly motion capture, and enabling the creation of large-scale, diverse
datasets for training generalist dexterous manipulation.
Ссылки и действия
Дополнительные ресурсы: