RL Is a Hammer and LLMs Are Nails: A Simple Reinforcement Learning Recipe for Strong Prompt Injection
2510.04885v1
cs.CR, cs.LG
2025-10-08
Авторы:
Yuxin Wen, Arman Zharmagambetov, Ivan Evtimov, Narine Kokhlikyan, Tom Goldstein, Kamalika Chaudhuri, Chuan Guo
Abstract
Prompt injection poses a serious threat to the reliability and safety of LLM
agents. Recent defenses against prompt injection, such as Instruction Hierarchy
and SecAlign, have shown notable robustness against static attacks. However, to
more thoroughly evaluate the robustness of these defenses, it is arguably
necessary to employ strong attacks such as automated red-teaming. To this end,
we introduce RL-Hammer, a simple recipe for training attacker models that
automatically learn to perform strong prompt injections and jailbreaks via
reinforcement learning. RL-Hammer requires no warm-up data and can be trained
entirely from scratch. To achieve high ASRs against industrial-level models
with defenses, we propose a set of practical techniques that enable highly
effective, universal attacks. Using this pipeline, RL-Hammer reaches a 98% ASR
against GPT-4o and a $72\%$ ASR against GPT-5 with the Instruction Hierarchy
defense. We further discuss the challenge of achieving high diversity in
attacks, highlighting how attacker models tend to reward-hack diversity
objectives. Finally, we show that RL-Hammer can evade multiple prompt injection
detectors. We hope our work advances automatic red-teaming and motivates the
development of stronger, more principled defenses. Code is available at
https://github.com/facebookresearch/rl-injector.
Ссылки и действия
Дополнительные ресурсы: