CapGeo: A Caption-Assisted Approach to Geometric Reasoning
2510.09302v1
cs.CV, cs.AI, cs.CL
2025-10-14
Авторы:
Yuying Li, Siyi Qian, Hao Liang, Leqi Zheng, Ruichuan An, Yongzhen Guo, Wentao Zhang
Abstract
Geometric reasoning remains a core challenge for Multimodal Large Language
Models (MLLMs). Even the most advanced closed-source systems, such as GPT-O3
and Gemini-2.5-Pro, still struggle to solve geometry problems reliably, despite
exhibiting strong textual reasoning abilities on tasks like the International
Mathematical Olympiad (IMO). This gap suggests that the bottleneck lies in
understanding geometric diagrams rather than reasoning itself. Since geometric
figures can often be faithfully described in concise textual form, converting
visual content into captions offers a promising direction. Motivated by this
insight, we introduce CapGeo, a caption-assisted reasoning framework that
bridges visual and textual modalities. Experiments show substantial
improvements when models are equipped with captions: Qwen2.5-VL-72B improves
from 8.6% (vision-only) to 59.0%, while Claude-Opus-4 rises from 44.8% to
73.0%. To systematically evaluate and identify high-quality geometric
captioning models, we further propose CapGeo-Bench, a dataset of 4,641 curated
figure-caption pairs. Crucially, CapGeo-Bench incorporates a keypoint-based
evaluation metric that correlates strongly with downstream CapGeo performance,
enabling reliable assessment of geometric captioning ability. Together, our
framework and benchmark highlight a new pathway toward advancing geometric
reasoning in MLLMs.
Ссылки и действия
Дополнительные ресурсы: