VERA-V: Variational Inference Framework for Jailbreaking Vision-Language Models
2510.17759v1
cs.CR, cs.CL, cs.CV, cs.LG, stat.ML
2025-10-22
Авторы:
Qilin Liao, Anamika Lochab, Ruqi Zhang
Abstract
Vision-Language Models (VLMs) extend large language models with visual
reasoning, but their multimodal design also introduces new, underexplored
vulnerabilities. Existing multimodal red-teaming methods largely rely on
brittle templates, focus on single-attack settings, and expose only a narrow
subset of vulnerabilities. To address these limitations, we introduce VERA-V, a
variational inference framework that recasts multimodal jailbreak discovery as
learning a joint posterior distribution over paired text-image prompts. This
probabilistic view enables the generation of stealthy, coupled adversarial
inputs that bypass model guardrails. We train a lightweight attacker to
approximate the posterior, allowing efficient sampling of diverse jailbreaks
and providing distributional insights into vulnerabilities. VERA-V further
integrates three complementary strategies: (i) typography-based text prompts
that embed harmful cues, (ii) diffusion-based image synthesis that introduces
adversarial signals, and (iii) structured distractors to fragment VLM
attention. Experiments on HarmBench and HADES benchmarks show that VERA-V
consistently outperforms state-of-the-art baselines on both open-source and
frontier VLMs, achieving up to 53.75% higher attack success rate (ASR) over the
best baseline on GPT-4o.