Data or Language Supervision: What Makes CLIP Better than DINO?
2510.11835v1
cs.CV, cs.AI, cs.CL, cs.LG, cs.MM
2025-10-16
Авторы:
Yiming Liu, Yuhui Zhang, Dhruba Ghosh, Ludwig Schmidt, Serena Yeung-Levy
Abstract
CLIP outperforms self-supervised models like DINO as vision encoders for
vision-language models (VLMs), but it remains unclear whether this advantage
stems from CLIP's language supervision or its much larger training data. To
disentangle these factors, we pre-train CLIP and DINO under controlled settings
-- using the same architecture, dataset, and training configuration --
achieving similar ImageNet accuracy. Embedding analysis shows that CLIP
captures high-level semantics (e.g., object categories, text), while DINO is
more responsive to low-level features like colors and styles. When integrated
into VLMs and evaluated on 20 VQA benchmarks, CLIP excels at text-intensive
tasks, while DINO slightly outperforms on vision-centric ones. Variants of
language supervision (e.g., sigmoid loss, pre-trained language encoders) yield
limited gains. Our findings provide scientific insights into vision encoder
design and its impact on VLM performance.