VCoT-Grasp: Grasp Foundation Models with Visual Chain-of-Thought Reasoning for Language-driven Grasp Generation
2510.05827v1
cs.RO, cs.AI
2025-10-09
Авторы:
Haoran Zhang, Shuanghao Bai, Wanqi Zhou, Yuedi Zhang, Qi Zhang, Pengxiang Ding, Cheng Chi, Donglin Wang, Badong Chen
Abstract
Robotic grasping is one of the most fundamental tasks in robotic
manipulation, and grasp detection/generation has long been the subject of
extensive research. Recently, language-driven grasp generation has emerged as a
promising direction due to its practical interaction capabilities. However,
most existing approaches either lack sufficient reasoning and generalization
capabilities or depend on complex modular pipelines. Moreover, current grasp
foundation models tend to overemphasize dialog and object semantics, resulting
in inferior performance and restriction to single-object grasping. To maintain
strong reasoning ability and generalization in cluttered environments, we
propose VCoT-Grasp, an end-to-end grasp foundation model that incorporates
visual chain-of-thought reasoning to enhance visual understanding for grasp
generation. VCoT-Grasp adopts a multi-turn processing paradigm that dynamically
focuses on visual inputs while providing interpretable reasoning traces. For
training, we refine and introduce a large-scale dataset, VCoT-GraspSet,
comprising 167K synthetic images with over 1.36M grasps, as well as 400+
real-world images with more than 1.2K grasps, annotated with intermediate
bounding boxes. Extensive experiments on both VCoT-GraspSet and real robot
demonstrate that our method significantly improves grasp success rates and
generalizes effectively to unseen objects, backgrounds, and distractors. More
details can be found at https://zhanghr2001.github.io/VCoT-Grasp.github.io.
Ссылки и действия
Дополнительные ресурсы: