Zero-Shot Vehicle Model Recognition via Text-Based Retrieval-Augmented Generation
2510.18502v1
cs.CV, cs.AI, cs.CL, cs.LG
2025-10-23
Авторы:
Wei-Chia Chang, Yan-Ann Chen
Abstract
Vehicle make and model recognition (VMMR) is an important task in intelligent
transportation systems, but existing approaches struggle to adapt to newly
released models. Contrastive Language-Image Pretraining (CLIP) provides strong
visual-text alignment, yet its fixed pretrained weights limit performance
without costly image-specific finetuning. We propose a pipeline that integrates
vision language models (VLMs) with Retrieval-Augmented Generation (RAG) to
support zero-shot recognition through text-based reasoning. A VLM converts
vehicle images into descriptive attributes, which are compared against a
database of textual features. Relevant entries are retrieved and combined with
the description to form a prompt, and a language model (LM) infers the make and
model. This design avoids large-scale retraining and enables rapid updates by
adding textual descriptions of new vehicles. Experiments show that the proposed
method improves recognition by nearly 20% over the CLIP baseline, demonstrating
the potential of RAG-enhanced LM reasoning for scalable VMMR in smart-city
applications.