VT-FSL: Bridging Vision and Text with LLMs for Few-Shot Learning
2509.25033v1
cs.CV, cs.LG, I.4.9
2025-10-01
Авторы:
Wenhao Li, Qiangchang Wang, Xianjing Meng, Zhibin Wu, Yilong Yin
Abstract
Few-shot learning (FSL) aims to recognize novel concepts from only a few
labeled support samples. Recent studies enhance support features by
incorporating additional semantic information or designing complex semantic
fusion modules. However, they still suffer from hallucinating semantics that
contradict the visual evidence due to the lack of grounding in actual
instances, resulting in noisy guidance and costly corrections. To address these
issues, we propose a novel framework, bridging Vision and Text with LLMs for
Few-Shot Learning (VT-FSL), which constructs precise cross-modal prompts
conditioned on Large Language Models (LLMs) and support images, seamlessly
integrating them through a geometry-aware alignment. It mainly consists of
Cross-modal Iterative Prompting (CIP) and Cross-modal Geometric Alignment
(CGA). Specifically, the CIP conditions an LLM on both class names and support
images to generate precise class descriptions iteratively in a single
structured reasoning pass. These descriptions not only enrich the semantic
understanding of novel classes but also enable the zero-shot synthesis of
semantically consistent images. The descriptions and synthetic images act
respectively as complementary textual and visual prompts, providing high-level
class semantics and low-level intra-class diversity to compensate for limited
support data. Furthermore, the CGA jointly aligns the fused textual, support,
and synthetic visual representations by minimizing the kernelized volume of the
3-dimensional parallelotope they span. It captures global and nonlinear
relationships among all representations, enabling structured and consistent
multimodal integration. The proposed VT-FSL method establishes new
state-of-the-art performance across ten diverse benchmarks, including standard,
cross-domain, and fine-grained few-shot learning scenarios. Code is available
at https://github.com/peacelwh/VT-FSL.
Ссылки и действия
Дополнительные ресурсы: