The Visual Iconicity Challenge: Evaluating Vision-Language Models on Sign Language Form-Meaning Mapping
2510.08482v1
cs.CV, cs.CL
2025-10-11
Авторы:
Onur Keleş, Aslı Özyürek, Gerardo Ortega, Kadir Gökgö, Esam Ghaleb
Abstract
Iconicity, the resemblance between linguistic form and meaning, is pervasive
in signed languages, offering a natural testbed for visual grounding. For
vision-language models (VLMs), the challenge is to recover such essential
mappings from dynamic human motion rather than static context. We introduce the
\textit{Visual Iconicity Challenge}, a novel video-based benchmark that adapts
psycholinguistic measures to evaluate VLMs on three tasks: (i) phonological
sign-form prediction (e.g., handshape, location), (ii) transparency (inferring
meaning from visual form), and (iii) graded iconicity ratings. We assess $13$
state-of-the-art VLMs in zero- and few-shot settings on Sign Language of the
Netherlands and compare them to human baselines. On \textit{phonological form
prediction}, VLMs recover some handshape and location detail but remain below
human performance; on \textit{transparency}, they are far from human baselines;
and only top models correlate moderately with human \textit{iconicity ratings}.
Interestingly, \textit{models with stronger phonological form prediction
correlate better with human iconicity judgment}, indicating shared sensitivity
to visually grounded structure. Our findings validate these diagnostic tasks
and motivate human-centric signals and embodied learning methods for modelling
iconicity and improving visual grounding in multimodal models.
Ссылки и действия
Дополнительные ресурсы: