Talking Points: Describing and Localizing Pixels
2510.14583v1
cs.CV, cs.CL
2025-10-18
Авторы:
Matan Rusanovsky, Shimon Malnick, Shai Avidan
Abstract
Vision-language models have achieved remarkable success in cross-modal
understanding. Yet, these models remain limited to object-level or region-level
grounding, lacking the capability for pixel-precise keypoint comprehension
through natural language. We introduce a novel framework for pixel level
grounding. The framework consists of two complementary components: a Point
Descriptor that generates rich, contextual descriptions of individual
keypoints, and a Point Localizer that regresses precise pixel coordinates from
these descriptions. Unlike prior work that relies on templated prompts or
keypoint names, our approach produces free-form, coarse-to-fine descriptions
that situate keypoints within their visual context. Since there is no available
dataset to train such a system, we introduce LlamaPointInPart, a carefully
curated dataset of 20K+ image-keypoint-description triplets synthesized from
multiple vision-language models, capturing multi-scale information from
scene-level context to visual features around the keypoint. For cross-category
generalization, we optimize the Point Descriptor on AP-10K via GRPO, using the
frozen Point Localizer as a reward model to produce descriptions that maximize
localization accuracy. To evaluate our results we establish a new evaluation
protocol. Instead of comparing the text description produced by our method to
the ground truth, we use the localizer to determine how close is the predicted
point generated to the ground truth point. Experiments demonstrate superior
performance compared to baseline models on LlamaPointInPart.The bidirectional
nature of our framework should enable future applications in both
keypoint-guided image understanding and language-guided precise localization.
Our code and dataset are publicly available at
https://github.com/matanr/Talking_Points.
Ссылки и действия
Дополнительные ресурсы: