Pattern Enhanced Multi-Turn Jailbreaking: Exploiting Structural Vulnerabilities in Large Language Models
2510.08859v1
cs.CL, cs.AI, cs.CR
2025-10-14
Авторы:
Ragib Amin Nihal, Rui Wen, Kazuhiro Nakadai, Jun Sakuma
Abstract
Large language models (LLMs) remain vulnerable to multi-turn jailbreaking
attacks that exploit conversational context to bypass safety constraints
gradually. These attacks target different harm categories (like malware
generation, harassment, or fraud) through distinct conversational approaches
(educational discussions, personal experiences, hypothetical scenarios).
Existing multi-turn jailbreaking methods often rely on heuristic or ad hoc
exploration strategies, providing limited insight into underlying model
weaknesses. The relationship between conversation patterns and model
vulnerabilities across harm categories remains poorly understood. We propose
Pattern Enhanced Chain of Attack (PE-CoA), a framework of five conversation
patterns to construct effective multi-turn jailbreaks through natural dialogue.
Evaluating PE-CoA on twelve LLMs spanning ten harm categories, we achieve
state-of-the-art performance, uncovering pattern-specific vulnerabilities and
LLM behavioral characteristics: models exhibit distinct weakness profiles where
robustness to one conversational pattern does not generalize to others, and
model families share similar failure modes. These findings highlight
limitations of safety training and indicate the need for pattern-aware
defenses. Code available on: https://github.com/Ragib-Amin-Nihal/PE-CoA
Ссылки и действия
Дополнительные ресурсы: