📊 Статистика дайджестов
Всего дайджестов: 34123 Добавлено сегодня: 101
Последнее обновление: сегодня
Авторы:
Yiming Liu, Yuhui Zhang, Dhruba Ghosh, Ludwig Schmidt, Serena Yeung-Levy
Саммари на русском не найдено
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']
Annotation:
CLIP outperforms self-supervised models like DINO as vision encoders for
vision-language models (VLMs), but it remains unclear whether this advantage
stems from CLIP's language supervision or its much larger training data. To
disentangle these factors, we pre-train CLIP and DINO under controlled settings
-- using the same architecture, dataset, and training configuration --
achieving similar ImageNet accuracy. Embedding analysis shows that CLIP
captures high-level semantics (e.g., object categor...
Авторы:
Leigang Qu, Ziyang Wang, Na Zheng, Wenjie Wang, Liqiang Nie, Tat-Seng Chua
Саммари на русском не найдено
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']
Доступные поля: ['id', 'arxiv_id', 'title', 'authors', 'abstract', 'summary_ru', 'categories', 'published_date', 'created_at']
Annotation:
Video Foundation Models (VFMs) exhibit remarkable visual generation
performance, but struggle in compositional scenarios (e.g., motion, numeracy,
and spatial relation). In this work, we introduce Test-Time Optimization and
Memorization (TTOM), a training-free framework that aligns VFM outputs with
spatiotemporal layouts during inference for better text-image alignment. Rather
than direct intervention to latents or attention per-sample in existing work,
we integrate and optimize new parameters gu...