Generalist Scanner Meets Specialist Locator: A Synergistic Coarse-to-Fine Framework for Robust GUI Grounding
2509.24133v1
cs.CV, cs.CL
2025-10-01
Авторы:
Zhecheng Li, Guoxian Song, Yiwei Wang, Zhen Xiong, Junsong Yuan, Yujun Cai
Abstract
Grounding natural language queries in graphical user interfaces (GUIs)
presents a challenging task that requires models to comprehend diverse UI
elements across various applications and systems, while also accurately
predicting the spatial coordinates for the intended operation. To tackle this
problem, we propose GMS: Generalist Scanner Meets Specialist Locator, a
synergistic coarse-to-fine framework that effectively improves GUI grounding
performance. GMS leverages the complementary strengths of general
vision-language models (VLMs) and small, task-specific GUI grounding models by
assigning them distinct roles within the framework. Specifically, the general
VLM acts as a 'Scanner' to identify potential regions of interest, while the
fine-tuned grounding model serves as a 'Locator' that outputs precise
coordinates within these regions. This design is inspired by how humans perform
GUI grounding, where the eyes scan the interface and the brain focuses on
interpretation and localization. Our whole framework consists of five stages
and incorporates hierarchical search with cross-modal communication to achieve
promising prediction results. Experimental results on the ScreenSpot-Pro
dataset show that while the 'Scanner' and 'Locator' models achieve only $2.0\%$
and $3.7\%$ accuracy respectively when used independently, their integration
within GMS framework yields an overall accuracy of $35.7\%$, representing a $10
\times$ improvement. Additionally, GMS significantly outperforms other strong
baselines under various settings, demonstrating its robustness and potential
for general-purpose GUI grounding.
Ссылки и действия
Дополнительные ресурсы: