Aligning Video Models with Human Social Judgments via Behavior-Guided Fine-Tuning
2510.01502v1
q-bio.NC, cs.CV, cs.LG
2025-10-04
Авторы:
Kathy Garcia, Leyla Isik
Abstract
Humans intuitively perceive complex social signals in visual scenes, yet it
remains unclear whether state-of-the-art AI models encode the same similarity
structure. We study (Q1) whether modern video and language models capture
human-perceived similarity in social videos, and (Q2) how to instill this
structure into models using human behavioral data. To address this, we
introduce a new benchmark of over 49,000 odd-one-out similarity judgments on
250 three-second video clips of social interactions, and discover a modality
gap: despite the task being visual, caption-based language embeddings align
better with human similarity than any pretrained video model. We close this gap
by fine-tuning a TimeSformer video model on these human judgments with our
novel hybrid triplet-RSA objective using low-rank adaptation (LoRA), aligning
pairwise distances to human similarity. This fine-tuning protocol yields
significantly improved alignment with human perceptions on held-out videos in
terms of both explained variance and odd-one-out triplet accuracy. Variance
partitioning shows that the fine-tuned video model increases shared variance
with language embeddings and explains additional unique variance not captured
by the language model. Finally, we test transfer via linear probes and find
that human-similarity fine-tuning strengthens the encoding of social-affective
attributes (intimacy, valence, dominance, communication) relative to the
pretrained baseline. Overall, our findings highlight a gap in pretrained video
models' social recognition and demonstrate that behavior-guided fine-tuning
shapes video representations toward human social perception.
Ссылки и действия
Дополнительные ресурсы: