T2I-RiskyPrompt: A Benchmark for Safety Evaluation, Attack, and Defense on Text-to-Image Model
2510.22300v1
cs.CR, cs.AI, cs.CV
2025-10-29
Авторы:
Chenyu Zhang, Tairen Zhang, Lanjun Wang, Ruidong Chen, Wenhui Li, Anan Liu
Abstract
Using risky text prompts, such as pornography and violent prompts, to test
the safety of text-to-image (T2I) models is a critical task. However, existing
risky prompt datasets are limited in three key areas: 1) limited risky
categories, 2) coarse-grained annotation, and 3) low effectiveness. To address
these limitations, we introduce T2I-RiskyPrompt, a comprehensive benchmark
designed for evaluating safety-related tasks in T2I models. Specifically, we
first develop a hierarchical risk taxonomy, which consists of 6 primary
categories and 14 fine-grained subcategories. Building upon this taxonomy, we
construct a pipeline to collect and annotate risky prompts. Finally, we obtain
6,432 effective risky prompts, where each prompt is annotated with both
hierarchical category labels and detailed risk reasons. Moreover, to facilitate
the evaluation, we propose a reason-driven risky image detection method that
explicitly aligns the MLLM with safety annotations. Based on T2I-RiskyPrompt,
we conduct a comprehensive evaluation of eight T2I models, nine defense
methods, five safety filters, and five attack strategies, offering nine key
insights into the strengths and limitations of T2I model safety. Finally, we
discuss potential applications of T2I-RiskyPrompt across various research
fields. The dataset and code are provided in
https://github.com/datar001/T2I-RiskyPrompt.
Ссылки и действия
Дополнительные ресурсы: