Efficient Discriminative Joint Encoders for Large Scale Vision-Language Reranking
2510.06820v1
cs.CV, cs.LG
2025-10-10
Авторы:
Mitchell Keren Taraday, Shahaf Wagner, Chaim Baskin
Abstract
Multimodal retrieval still leans on embedding-based models like CLIP for fast
vector search over pre-computed image embeddings. Yet, unlike text retrieval,
where joint-encoder rerankers are standard, comparable vision--language
rerankers are largely absent. We find that seminal joint encoders such as BLIP
are severely bottlenecked by an expensive visual feature-extraction stage,
preventing practical deployment at scale. Motivated by this bottleneck, we
introduce EDJE, an Efficient Discriminative Joint Encoder that precomputes
vision tokens offline and compresses them via a lightweight attention-based
adapter, so online inference runs only a compact joint encoder over a small set
of visual tokens plus the text. EDJE preserves strong retrieval performance
while drastically reducing storage and online compute, enabling high-throughput
inference. Specifically, EDJE processes 50k image--text pairs/second while
requiring 49kB of disk storage per image, matching prior art on Flickr
(zero-shot) and COCO (fine-tuned) retrieval. The implementation and checkpoints
will be made publicly available shortly.
Ссылки и действия
Дополнительные ресурсы: