AutoDAN-Reasoning: Enhancing Strategies Exploration based Jailbreak Attacks with Test-Time Scaling
2510.05379v2
cs.CR, cs.AI
2025-10-09
Авторы:
Xiaogeng Liu, Chaowei Xiao
Abstract
Recent advancements in jailbreaking large language models (LLMs), such as
AutoDAN-Turbo, have demonstrated the power of automated strategy discovery.
AutoDAN-Turbo employs a lifelong learning agent to build a rich library of
attack strategies from scratch. While highly effective, its test-time
generation process involves sampling a strategy and generating a single
corresponding attack prompt, which may not fully exploit the potential of the
learned strategy library. In this paper, we propose to further improve the
attack performance of AutoDAN-Turbo through test-time scaling. We introduce two
distinct scaling methods: Best-of-N and Beam Search. The Best-of-N method
generates N candidate attack prompts from a sampled strategy and selects the
most effective one based on a scorer model. The Beam Search method conducts a
more exhaustive search by exploring combinations of strategies from the library
to discover more potent and synergistic attack vectors. According to the
experiments, the proposed methods significantly boost performance, with Beam
Search increasing the attack success rate by up to 15.6 percentage points on
Llama-3.1-70B-Instruct and achieving a nearly 60% relative improvement against
the highly robust GPT-o4-mini compared to the vanilla method.
Ссылки и действия
Дополнительные ресурсы: