Topological Alignment of Shared Vision-Language Embedding Space
2510.10889v1
cs.CV, cs.AI, cs.LG
2025-10-15
Авторы:
Junwon You, Dasol Kang, Jae-Hun Jung
Abstract
Contrastive Vision-Language Models (VLMs) have demonstrated strong zero-shot
capabilities. However, their cross-modal alignment remains biased toward
English due to limited multilingual multimodal data. Recent multilingual
extensions have alleviated this gap but enforce instance-level alignment while
neglecting the global geometry of the shared embedding space. We address this
problem by introducing ToMCLIP (Topological Alignment for Multilingual CLIP), a
topology-aware framework aligning embedding spaces with topology-preserving
constraints. The proposed method applies persistent homology to define a
topological alignment loss and approximates persistence diagram with
theoretical error bounds using graph sparsification strategy. This work
validates the proposed approach, showing enhanced structural coherence of
multilingual representations, higher zero-shot accuracy on the CIFAR-100, and
stronger multilingual retrieval performance on the xFlickr&CO. Beyond VLMs, the
proposed approach provides a general method for incorporating topological
alignment into representation learning.
Ссылки и действия
Дополнительные ресурсы: