AutoAdv: Automated Adversarial Prompting for Multi-Turn Jailbreaking of Large Language Models

2511.02376v1 cs.CL, cs.AI, cs.CR, cs.LG 2025-11-06

Авторы:

Aashray Reddy, Andrew Zagula, Nicholas Saban

Abstract

Large Language Models (LLMs) remain vulnerable to jailbreaking attacks where adversarial prompts elicit harmful outputs, yet most evaluations focus on single-turn interactions while real-world attacks unfold through adaptive multi-turn conversations. We present AutoAdv, a training-free framework for automated multi-turn jailbreaking that achieves up to 95% attack success rate on Llama-3.1-8B within six turns a 24 percent improvement over single turn baselines. AutoAdv uniquely combines three adaptive mechanisms: a pattern manager that learns from successful attacks to enhance future prompts, a temperature manager that dynamically adjusts sampling parameters based on failure modes, and a two-phase rewriting strategy that disguises harmful requests then iteratively refines them. Extensive evaluation across commercial and open-source models (GPT-4o-mini, Qwen3-235B, Mistral-7B) reveals persistent vulnerabilities in current safety mechanisms, with multi-turn attacks consistently outperforming single-turn approaches. These findings demonstrate that alignment strategies optimized for single-turn interactions fail to maintain robustness across extended conversations, highlighting an urgent need for multi-turn-aware defenses.

Ссылки и действия

Читать на arXiv Скачать PDF

Дополнительные ресурсы:

AutoAdv: Automated Adversarial Prompting for Multi-Turn Jailbreaking of Large Language Models

Авторы:

Abstract

Ссылки и действия

Связанные статьи

In-Context Representation Hijacking

In-Context Representation Hijacking

Memories Retrieved from Many Paths: A Multi-Prefix Framework for Robust Detectio...

SECA: Semantically Equivalent and Coherent Attacks for Eliciting LLM Hallucinati...

Навигация