SETR: A Two-Stage Semantic-Enhanced Framework for Zero-Shot Composed Image Retrieval
2509.26012v1
cs.CV, I.4.9
2025-10-02
Авторы:
Yuqi Xiao, Yingying Zhu
Abstract
Zero-shot Composed Image Retrieval (ZS-CIR) aims to retrieve a target image
given a reference image and a relative text, without relying on costly triplet
annotations. Existing CLIP-based methods face two core challenges: (1)
union-based feature fusion indiscriminately aggregates all visual cues,
carrying over irrelevant background details that dilute the intended
modification, and (2) global cosine similarity from CLIP embeddings lacks the
ability to resolve fine-grained semantic relations. To address these issues, we
propose SETR (Semantic-enhanced Two-Stage Retrieval). In the coarse retrieval
stage, SETR introduces an intersection-driven strategy that retains only the
overlapping semantics between the reference image and relative text, thereby
filtering out distractors inherent to union-based fusion and producing a
cleaner, high-precision candidate set. In the fine-grained re-ranking stage, we
adapt a pretrained multimodal LLM with Low-Rank Adaptation to conduct binary
semantic relevance judgments ("Yes/No"), which goes beyond CLIP's global
feature matching by explicitly verifying relational and attribute-level
consistency. Together, these two stages form a complementary pipeline: coarse
retrieval narrows the candidate pool with high recall, while re-ranking ensures
precise alignment with nuanced textual modifications. Experiments on CIRR,
Fashion-IQ, and CIRCO show that SETR achieves new state-of-the-art performance,
improving Recall@1 on CIRR by up to 15.15 points. Our results establish
two-stage reasoning as a general paradigm for robust and portable ZS-CIR.
Ссылки и действия
Дополнительные ресурсы: