Explore More, Learn Better: Parallel MLLM Embeddings under Mutual Information Minimization
2511.01588v1
cs.LG, cs.CV
2025-11-06
Авторы:
Zhicheng Wang, Chen Ju, Xu Chen, Shuai Xiao, Jinsong Lan, Xiaoyong Zhu, Ying Chen, Zhiguo Cao
Abstract
Embedding models are a cornerstone of modern AI. Driven by Multimodal Large
Language Models (MLLMs), they have made great progress in architecture and data
curation, while the holistic paradigm is still limited to SSC, i.e., single
input, singular embedding, contrastive supervision, which collapses rich,
multifaceted inputs into monolithic embeddings and fails to fully exploit MLLM
capabilities. In this paper, we tailor one Parallel Decoupling Framework (PDF)
for multimodal embedding learning, by utilizing the proprietary steerability of
MLLMs, i.e., their ability to flexibly generate quite differentiated response
under explicit instructions. Concretely, PDF conditions a shared MLLM backbone
on distinct, learnable prefixes to roll out multiple parallel paths for one
input, then relies on these paths to obtain parallel embeddings. To promote
full parallel diversity, we employ Mutual Information Minimization (MIM) as an
explicit constraint, coupled with per-path contrastive supervision to maintain
semantic alignment. Such dual-objectives force PDF to yield robust semantic
coverage and a generalizable embedding space. Ultimately, the remarkable
embedding space are accessible at inference via one single forward pass,
incurring negligible computational overhead. We instantiate PDF on multiple
MLLM backbones and prove its effectiveness on MMEB benchmark. Significant gains
are consistently achieved across various resolutions and model sizes, e.g.,
boosting the VLM2Vec-LLaVA-1.6-LR model by a remarkable +8.9% (7B), while the
VLM2Vec-Qwen2VL models by +4.2% (2B) and +3.1% (7B). In terms of efficiency,
our 2B model surpasses its baseline by +2.6% using only half the computational
budget.
Ссылки и действия
Дополнительные ресурсы: