Which Way Does Time Flow? A Psychophysics-Grounded Evaluation for Vision-Language Models
2510.26241v1
cs.CV, cs.CL
2025-11-01
Авторы:
Shiho Matta, Lis Kanashiro Pereira, Peitao Han, Fei Cheng, Shigeru Kitazawa
Abstract
Modern vision-language models (VLMs) excel at many multimodal tasks, yet
their grasp of temporal information in video remains weak and, crucially,
under-evaluated. We probe this gap with a deceptively simple but revealing
challenge: judging the arrow of time (AoT)-whether a short clip is played
forward or backward. We introduce AoT-PsyPhyBENCH, a psychophysically validated
benchmark that tests whether VLMs can infer temporal direction in natural
videos using the same stimuli and behavioral baselines established for humans.
Our comprehensive evaluation of open-weight and proprietary, reasoning and
non-reasoning VLMs reveals that most models perform near chance, and even the
best lag far behind human accuracy on physically irreversible processes (e.g.,
free fall, diffusion/explosion) and causal manual actions (division/addition)
that humans recognize almost instantly. These results highlight a fundamental
gap in current multimodal systems: while they capture rich visual-semantic
correlations, they lack the inductive biases required for temporal continuity
and causal understanding. We release the code and data for AoT-PsyPhyBENCH to
encourage further progress in the physical and temporal reasoning capabilities
of VLMs.
Ссылки и действия
Дополнительные ресурсы: