Jailbreak Mimicry: Automated Discovery of Narrative-Based Jailbreaks for Large Language Models
2510.22085v1
cs.CR, cs.AI, cs.CL, cs.LG, I.2.7; I.2.0; K.6.5
2025-10-29
Авторы:
Pavlos Ntais
Abstract
Large language models (LLMs) remain vulnerable to sophisticated prompt
engineering attacks that exploit contextual framing to bypass safety
mechanisms, posing significant risks in cybersecurity applications. We
introduce Jailbreak Mimicry, a systematic methodology for training compact
attacker models to automatically generate narrative-based jailbreak prompts in
a one-shot manner. Our approach transforms adversarial prompt discovery from
manual craftsmanship into a reproducible scientific process, enabling proactive
vulnerability assessment in AI-driven security systems. Developed for the
OpenAI GPT-OSS-20B Red-Teaming Challenge, we use parameter-efficient
fine-tuning (LoRA) on Mistral-7B with a curated dataset derived from AdvBench,
achieving an 81.0% Attack Success Rate (ASR) against GPT-OSS-20B on a held-out
test set of 200 items. Cross-model evaluation reveals significant variation in
vulnerability patterns: our attacks achieve 66.5% ASR against GPT-4, 79.5% on
Llama-3 and 33.0% against Gemini 2.5 Flash, demonstrating both broad
applicability and model-specific defensive strengths in cybersecurity contexts.
This represents a 54x improvement over direct prompting (1.5% ASR) and
demonstrates systematic vulnerabilities in current safety alignment approaches.
Our analysis reveals that technical domains (Cybersecurity: 93% ASR) and
deception-based attacks (Fraud: 87.8% ASR) are particularly vulnerable,
highlighting threats to AI-integrated threat detection, malware analysis, and
secure systems, while physical harm categories show greater resistance (55.6%
ASR). We employ automated harmfulness evaluation using Claude Sonnet 4,
cross-validated with human expert assessment, ensuring reliable and scalable
evaluation for cybersecurity red-teaming. Finally, we analyze failure
mechanisms and discuss defensive strategies to mitigate these vulnerabilities
in AI for cybersecurity.