MCA: Modality Composition Awareness for Robust Composed Multimodal Retrieval
2510.15543v1
cs.CL, cs.AI, cs.IR, cs.MM
2025-10-21
Авторы:
Qiyu Wu, Shuyang Cui, Satoshi Hayakawa, Wei-Yao Wang, Hiromi Wakaki, Yuki Mitsufuji
Abstract
Multimodal retrieval, which seeks to retrieve relevant content across
modalities such as text or image, supports applications from AI search to
contents production. Despite the success of separate-encoder approaches like
CLIP align modality-specific embeddings with contrastive learning, recent
multimodal large language models (MLLMs) enable a unified encoder that directly
processes composed inputs. While flexible and advanced, we identify that
unified encoders trained with conventional contrastive learning are prone to
learn modality shortcut, leading to poor robustness under distribution shifts.
We propose a modality composition awareness framework to mitigate this issue.
Concretely, a preference loss enforces multimodal embeddings to outperform
their unimodal counterparts, while a composition regularization objective
aligns multimodal embeddings with prototypes composed from its unimodal parts.
These objectives explicitly model structural relationships between the composed
representation and its unimodal counterparts. Experiments on various benchmarks
show gains in out-of-distribution retrieval, highlighting modality composition
awareness as a effective principle for robust composed multimodal retrieval
when utilizing MLLMs as the unified encoder.