Dynamic Target Attack
2510.02422v1
cs.CR, cs.AI
2025-10-07
Авторы:
Kedong Xiu, Churui Zeng, Tianhang Zheng, Xinzhe Huang, Xiaojun Jia, Di Wang, Puning Zhao, Zhan Qin, Kui Ren
Abstract
Existing gradient-based jailbreak attacks typically optimize an adversarial
suffix to induce a fixed affirmative response. However, this fixed target
usually resides in an extremely low-density region of a safety-aligned LLM's
output distribution conditioned on diverse harmful inputs. Due to the
substantial discrepancy between the target and the original output, existing
attacks require numerous iterations to optimize the adversarial prompt, which
might still fail to induce the low-probability target response from the target
LLM. In this paper, we propose Dynamic Target Attack (DTA), a new jailbreaking
framework relying on the target LLM's own responses as targets to optimize the
adversarial prompts. In each optimization round, DTA iteratively samples
multiple candidate responses directly from the output distribution conditioned
on the current prompt, and selects the most harmful response as a temporary
target for prompt optimization. In contrast to existing attacks, DTA
significantly reduces the discrepancy between the target and the output
distribution, substantially easing the optimization process to search for an
effective adversarial prompt.
Extensive experiments demonstrate the superior effectiveness and efficiency
of DTA: under the white-box setting, DTA only needs 200 optimization iterations
to achieve an average attack success rate (ASR) of over 87\% on recent
safety-aligned LLMs, exceeding the state-of-the-art baselines by over 15\%. The
time cost of DTA is 2-26 times less than existing baselines. Under the
black-box setting, DTA uses Llama-3-8B-Instruct as a surrogate model for target
sampling and achieves an ASR of 85\% against the black-box target model
Llama-3-70B-Instruct, exceeding its counterparts by over 25\%.
Ссылки и действия
Дополнительные ресурсы: