LLM-RG: Referential Grounding in Outdoor Scenarios using Large Language Models
2509.25528v1
cs.CV, cs.AI, cs.RO
2025-10-02
Авторы:
Pranav Saxena, Avigyan Bhattacharya, Ji Zhang, Wenshan Wang
Abstract
Referential grounding in outdoor driving scenes is challenging due to large
scene variability, many visually similar objects, and dynamic elements that
complicate resolving natural-language references (e.g., "the black car on the
right"). We propose LLM-RG, a hybrid pipeline that combines off-the-shelf
vision-language models for fine-grained attribute extraction with large
language models for symbolic reasoning. LLM-RG processes an image and a
free-form referring expression by using an LLM to extract relevant object types
and attributes, detecting candidate regions, generating rich visual descriptors
with a VLM, and then combining these descriptors with spatial metadata into
natural-language prompts that are input to an LLM for chain-of-thought
reasoning to identify the referent's bounding box. Evaluated on the Talk2Car
benchmark, LLM-RG yields substantial gains over both LLM and VLM-based
baselines. Additionally, our ablations show that adding 3D spatial cues further
improves grounding. Our results demonstrate the complementary strengths of VLMs
and LLMs, applied in a zero-shot manner, for robust outdoor referential
grounding.
Ссылки и действия
Дополнительные ресурсы: