Proactive defense against LLM Jailbreak
2510.05052v1
cs.CR, cs.CL
2025-10-08
Авторы:
Weiliang Zhao, Jinjun Peng, Daniel Ben-Levi, Zhou Yu, Junfeng Yang
Abstract
The proliferation of powerful large language models (LLMs) has necessitated
robust safety alignment, yet these models remain vulnerable to evolving
adversarial attacks, including multi-turn jailbreaks that iteratively search
for successful queries. Current defenses, primarily reactive and static, often
fail to counter these search-based attacks. In this paper, we introduce ProAct,
a novel proactive defense framework designed to disrupt and mislead autonomous
jailbreaking processes. Our core idea is to intentionally provide adversaries
with "spurious responses" that appear to be results of successful jailbreak
attacks but contain no actual harmful content. These misleading responses
provide false signals to the attacker's internal optimization loop, causing the
adversarial search to terminate prematurely and effectively jailbreaking the
jailbreak. By conducting extensive experiments across state-of-the-art LLMs,
jailbreaking frameworks, and safety benchmarks, our method consistently and
significantly reduces attack success rates by up to 92\%. When combined with
other defense frameworks, it further reduces the success rate of the latest
attack strategies to 0\%. ProAct represents an orthogonal defense strategy that
can serve as an additional guardrail to enhance LLM safety against the most
effective jailbreaking attacks.
Ссылки и действия
Дополнительные ресурсы: