SoK: Taxonomy and Evaluation of Prompt Security in Large Language Models
2510.15476v1
cs.CR, cs.AI
2025-10-21
Авторы:
Hanbin Hong, Shuya Feng, Nima Naderloui, Shenao Yan, Jingyu Zhang, Biying Liu, Ali Arastehfard, Heqing Huang, Yuan Hong
Abstract
Large Language Models (LLMs) have rapidly become integral to real-world
applications, powering services across diverse sectors. However, their
widespread deployment has exposed critical security risks, particularly through
jailbreak prompts that can bypass model alignment and induce harmful outputs.
Despite intense research into both attack and defense techniques, the field
remains fragmented: definitions, threat models, and evaluation criteria vary
widely, impeding systematic progress and fair comparison. In this
Systematization of Knowledge (SoK), we address these challenges by (1)
proposing a holistic, multi-level taxonomy that organizes attacks, defenses,
and vulnerabilities in LLM prompt security; (2) formalizing threat models and
cost assumptions into machine-readable profiles for reproducible evaluation;
(3) introducing an open-source evaluation toolkit for standardized, auditable
comparison of attacks and defenses; (4) releasing JAILBREAKDB, the largest
annotated dataset of jailbreak and benign prompts to date; and (5) presenting a
comprehensive evaluation and leaderboard of state-of-the-art methods. Our work
unifies fragmented research, provides rigorous foundations for future studies,
and supports the development of robust, trustworthy LLMs suitable for
high-stakes deployment.
Ссылки и действия
Дополнительные ресурсы: