VLSU: Mapping the Limits of Joint Multimodal Understanding for AI Safety
2510.18214v1
cs.CV, cs.AI, cs.CL, cs.LG
2025-10-23
Авторы:
Shruti Palaskar, Leon Gatys, Mona Abdelrahman, Mar Jacobo, Larry Lindsey, Rutika Moharir, Gunnar Lund, Yang Xu, Navid Shiee, Jeffrey Bigham, Charles Maalouf, Joseph Yitan Cheng
Abstract
Safety evaluation of multimodal foundation models often treats vision and
language inputs separately, missing risks from joint interpretation where
benign content becomes harmful in combination. Existing approaches also fail to
distinguish clearly unsafe content from borderline cases, leading to
problematic over-blocking or under-refusal of genuinely harmful content. We
present Vision Language Safety Understanding (VLSU), a comprehensive framework
to systematically evaluate multimodal safety through fine-grained severity
classification and combinatorial analysis across 17 distinct safety patterns.
Using a multi-stage pipeline with real-world images and human annotation, we
construct a large-scale benchmark of 8,187 samples spanning 15 harm categories.
Our evaluation of eleven state-of-the-art models reveals systematic joint
understanding failures: while models achieve 90%-plus accuracy on clear
unimodal safety signals, performance degrades substantially to 20-55% when
joint image-text reasoning is required to determine the safety label. Most
critically, 34% of errors in joint image-text safety classification occur
despite correct classification of the individual modalities, further
demonstrating absent compositional reasoning capabilities. Additionally, we
find that models struggle to balance refusing unsafe content while still
responding to borderline cases that deserve engagement. For example, we find
that instruction framing can reduce the over-blocking rate on borderline
content from 62.4% to 10.4% in Gemini-1.5, but only at the cost of
under-refusing on unsafe content with refusal rate dropping from 90.8% to
53.9%. Overall, our framework exposes weaknesses in joint image-text
understanding and alignment gaps in current models, and provides a critical
test bed to enable the next milestones in research on robust vision-language
safety.