AutoAdv: Automated Adversarial Prompting for Multi-Turn Jailbreaking of Large Language Models
2511.02376v1
cs.CL, cs.AI, cs.CR, cs.LG
2025-11-06
Авторы:
Aashray Reddy, Andrew Zagula, Nicholas Saban
Abstract
Large Language Models (LLMs) remain vulnerable to jailbreaking attacks where
adversarial prompts elicit harmful outputs, yet most evaluations focus on
single-turn interactions while real-world attacks unfold through adaptive
multi-turn conversations. We present AutoAdv, a training-free framework for
automated multi-turn jailbreaking that achieves up to 95% attack success rate
on Llama-3.1-8B within six turns a 24 percent improvement over single turn
baselines. AutoAdv uniquely combines three adaptive mechanisms: a pattern
manager that learns from successful attacks to enhance future prompts, a
temperature manager that dynamically adjusts sampling parameters based on
failure modes, and a two-phase rewriting strategy that disguises harmful
requests then iteratively refines them. Extensive evaluation across commercial
and open-source models (GPT-4o-mini, Qwen3-235B, Mistral-7B) reveals persistent
vulnerabilities in current safety mechanisms, with multi-turn attacks
consistently outperforming single-turn approaches. These findings demonstrate
that alignment strategies optimized for single-turn interactions fail to
maintain robustness across extended conversations, highlighting an urgent need
for multi-turn-aware defenses.