Test-Time Matching: Unlocking Compositional Reasoning in Multimodal Models
2510.07632v1
cs.AI, cs.CL, cs.CV, cs.LG
2025-10-11
Авторы:
Yinglun Zhu, Jiancheng Zhang, Fuzhi Tang
Abstract
Frontier AI models have achieved remarkable progress, yet recent studies
suggest they struggle with compositional reasoning, often performing at or
below random chance on established benchmarks. We revisit this problem and show
that widely used evaluation metrics systematically underestimate model
capability. To address this, we introduce a group matching score that better
exploits group structure and reveals substantial hidden capability in both
contrastive vision-language models (VLMs) and multimodal large language models
(MLLMs). Moreover, simply overfitting to the induced group matchings at test
time transfers this hidden capability into higher scores under standard
evaluation metrics, closing much of the reported gap. This adjustment enables
SigLIP-B16 to surpass all previous results and GPT-4.1 to yield the first
result surpassing estimated human performance on Winoground.
Building on this insight, we propose Test-Time Matching (TTM), an iterative,
self-improving algorithm that further bootstraps model performance without any
external supervision. TTM delivers additional, non-trivial improvements: for
example, TTM enables SigLIP-B16 to surpass GPT-4.1 on MMVP-VLM, establishing a
new state of the art. Importantly, TTM remains broadly effective even on
benchmarks without metric-induced effects or group structures, achieving
relative gains up to 85.7% on challenging datasets such as WhatsUp. Across 16
dataset variants spanning diverse setups, our experiments demonstrate that TTM
consistently improves model performance and advances the frontier of
compositional reasoning.