The Visual Iconicity Challenge: Evaluating Vision-Language Models on Sign Language Form-Meaning Mapping
2510.08482v2
cs.CV, cs.CL
2025-10-14
Авторы:
Onur Keleş, Aslı Özyürek, Gerardo Ortega, Kadir Gökgöz, Esam Ghaleb
Abstract
Iconicity, the resemblance between linguistic form and meaning, is pervasive
in signed languages, offering a natural testbed for visual grounding. For
vision-language models (VLMs), the challenge is to recover such essential
mappings from dynamic human motion rather than static context. We introduce the
Visual Iconicity Challenge, a novel video-based benchmark that adapts
psycholinguistic measures to evaluate VLMs on three tasks: (i) phonological
sign-form prediction (e.g., handshape, location), (ii) transparency (inferring
meaning from visual form), and (iii) graded iconicity ratings. We assess 13
state-of-the-art VLMs in zero- and few-shot settings on Sign Language of the
Netherlands and compare them to human baselines. On phonological form
prediction, VLMs recover some handshape and location detail but remain below
human performance; on transparency, they are far from human baselines; and only
top models correlate moderately with human iconicity ratings. Interestingly,
models with stronger phonological form prediction correlate better with human
iconicity judgment, indicating shared sensitivity to visually grounded
structure. Our findings validate these diagnostic tasks and motivate
human-centric signals and embodied learning methods for modelling iconicity and
improving visual grounding in multimodal models.
Ссылки и действия
Дополнительные ресурсы: