AgentTypo: Adaptive Typographic Prompt Injection Attacks against Black-box Multimodal Agents
2510.04257v1
cs.CR, cs.AI
2025-10-08
Авторы:
Yanjie Li, Yiming Cao, Dong Wang, Bin Xiao
Abstract
Multimodal agents built on large vision-language models (LVLMs) are
increasingly deployed in open-world settings but remain highly vulnerable to
prompt injection, especially through visual inputs. We introduce AgentTypo, a
black-box red-teaming framework that mounts adaptive typographic prompt
injection by embedding optimized text into webpage images. Our automatic
typographic prompt injection (ATPI) algorithm maximizes prompt reconstruction
by substituting captioners while minimizing human detectability via a stealth
loss, with a Tree-structured Parzen Estimator guiding black-box optimization
over text placement, size, and color. To further enhance attack strength, we
develop AgentTypo-pro, a multi-LLM system that iteratively refines injection
prompts using evaluation feedback and retrieves successful past examples for
continual learning. Effective prompts are abstracted into generalizable
strategies and stored in a strategy repository, enabling progressive knowledge
accumulation and reuse in future attacks. Experiments on the VWA-Adv benchmark
across Classifieds, Shopping, and Reddit scenarios show that AgentTypo
significantly outperforms the latest image-based attacks such as AgentAttack.
On GPT-4o agents, our image-only attack raises the success rate from 0.23 to
0.45, with consistent results across GPT-4V, GPT-4o-mini, Gemini 1.5 Pro, and
Claude 3 Opus. In image+text settings, AgentTypo achieves 0.68 ASR, also
outperforming the latest baselines. Our findings reveal that AgentTypo poses a
practical and potent threat to multimodal agents and highlight the urgent need
for effective defense.
Ссылки и действия
Дополнительные ресурсы: