Manual2Skill++: Connector-Aware General Robotic Assembly from Instruction Manuals via Vision-Language Models
2510.16344v1
cs.RO, cs.AI
2025-10-22
Авторы:
Chenrui Tie, Shengxiang Sun, Yudi Lin, Yanbo Wang, Zhongrui Li, Zhouhan Zhong, Jinxuan Zhu, Yiman Pang, Haonan Chen, Junting Chen, Ruihai Wu, Lin Shao
Abstract
Assembly hinges on reliably forming connections between parts; yet most
robotic approaches plan assembly sequences and part poses while treating
connectors as an afterthought. Connections represent the critical "last mile"
of assembly execution, while task planning may sequence operations and motion
plan may position parts, the precise establishment of physical connections
ultimately determines assembly success or failure. In this paper, we consider
connections as first-class primitives in assembly representation, including
connector types, specifications, quantities, and placement locations. Drawing
inspiration from how humans learn assembly tasks through step-by-step
instruction manuals, we present Manual2Skill++, a vision-language framework
that automatically extracts structured connection information from assembly
manuals. We encode assembly tasks as hierarchical graphs where nodes represent
parts and sub-assemblies, and edges explicitly model connection relationships
between components. A large-scale vision-language model parses symbolic
diagrams and annotations in manuals to instantiate these graphs, leveraging the
rich connection knowledge embedded in human-designed instructions. We curate a
dataset containing over 20 assembly tasks with diverse connector types to
validate our representation extraction approach, and evaluate the complete task
understanding-to-execution pipeline across four complex assembly scenarios in
simulation, spanning furniture, toys, and manufacturing components with
real-world correspondence.
Ссылки и действия
Дополнительные ресурсы: