An Automated Framework for Strategy Discovery, Retrieval, and Evolution in LLM Jailbreak Attacks
2511.02356v1
cs.CR, cs.LG
2025-11-06
Авторы:
Xu Liu, Yan Chen, Kan Ling, Yichi Zhu, Hengrun Zhang, Guisheng Fan, Huiqun Yu
Abstract
The widespread deployment of Large Language Models (LLMs) as public-facing
web services and APIs has made their security a core concern for the web
ecosystem. Jailbreak attacks, as one of the significant threats to LLMs, have
recently attracted extensive research. In this paper, we reveal a jailbreak
strategy which can effectively evade current defense strategies. It can extract
valuable information from failed or partially successful attack attempts and
contains self-evolution from attack interactions, resulting in sufficient
strategy diversity and adaptability. Inspired by continuous learning and
modular design principles, we propose ASTRA, a jailbreak framework that
autonomously discovers, retrieves, and evolves attack strategies to achieve
more efficient and adaptive attacks. To enable this autonomous evolution, we
design a closed-loop "attack-evaluate-distill-reuse" core mechanism that not
only generates attack prompts but also automatically distills and generalizes
reusable attack strategies from every interaction. To systematically accumulate
and apply this attack knowledge, we introduce a three-tier strategy library
that categorizes strategies into Effective, Promising, and Ineffective based on
their performance scores. The strategy library not only provides precise
guidance for attack generation but also possesses exceptional extensibility and
transferability. We conduct extensive experiments under a black-box setting,
and the results show that ASTRA achieves an average Attack Success Rate (ASR)
of 82.7%, significantly outperforming baselines.
Ссылки и действия
Дополнительные ресурсы: