TIGeR: Tool-Integrated Geometric Reasoning in Vision-Language Models for Robotics
2510.07181v2
cs.RO, cs.AI, cs.CV
2025-10-10
Авторы:
Yi Han, Cheng Chi, Enshen Zhou, Shanyu Rong, Jingkun An, Pengwei Wang, Zhongyuan Wang, Lu Sheng, Shanghang Zhang
Abstract
Vision-Language Models (VLMs) have shown remarkable capabilities in spatial
reasoning, yet they remain fundamentally limited to qualitative precision and
lack the computational precision required for real-world robotics. Current
approaches fail to leverage metric cues from depth sensors and camera
calibration, instead reducing geometric problems to pattern recognition tasks
that cannot deliver the centimeter-level accuracy essential for robotic
manipulation. We present TIGeR (Tool-Integrated Geometric Reasoning), a novel
framework that transforms VLMs from perceptual estimators to geometric
computers by enabling them to generate and execute precise geometric
computations through external tools. Rather than attempting to internalize
complex geometric operations within neural networks, TIGeR empowers models to
recognize geometric reasoning requirements, synthesize appropriate
computational code, and invoke specialized libraries for exact calculations. To
support this paradigm, we introduce TIGeR-300K, a comprehensive
tool-invocation-oriented dataset covering point transformations, pose
estimation, and spatial compatibility verification, complete with tool
invocation sequences and intermediate computations. Through a two-stage
training pipeline combining supervised fine-tuning (SFT) and reinforcement
fine-tuning (RFT) with our proposed hierarchical reward design, TIGeR achieves
SOTA performance on geometric reasoning benchmarks while demonstrating
centimeter-level precision in real-world robotic manipulation tasks.
Ссылки и действия
Дополнительные ресурсы: