Image Hashing via Cross-View Code Alignment in the Age of Foundation Models
2510.27584v1
cs.CV, cs.IR, cs.LG
2025-11-04
Авторы:
Ilyass Moummad, Kawtar Zaher, Hervé Goëau, Alexis Joly
Abstract
Efficient large-scale retrieval requires representations that are both
compact and discriminative. Foundation models provide powerful visual and
multimodal embeddings, but nearest neighbor search in these high-dimensional
spaces is computationally expensive. Hashing offers an efficient alternative by
enabling fast Hamming distance search with binary codes, yet existing
approaches often rely on complex pipelines, multi-term objectives, designs
specialized for a single learning paradigm, and long training times. We
introduce CroVCA (Cross-View Code Alignment), a simple and unified principle
for learning binary codes that remain consistent across semantically aligned
views. A single binary cross-entropy loss enforces alignment, while coding-rate
maximization serves as an anti-collapse regularizer to promote balanced and
diverse codes. To implement this, we design HashCoder, a lightweight MLP
hashing network with a final batch normalization layer to enforce balanced
codes. HashCoder can be used as a probing head on frozen embeddings or to adapt
encoders efficiently via LoRA fine-tuning. Across benchmarks, CroVCA achieves
state-of-the-art results in just 5 training epochs. At 16 bits, it particularly
well-for instance, unsupervised hashing on COCO completes in under 2 minutes
and supervised hashing on ImageNet100 in about 3 minutes on a single GPU. These
results highlight CroVCA's efficiency, adaptability, and broad applicability.
Ссылки и действия
Дополнительные ресурсы: