Black-Box Guardrail Reverse-engineering Attack
2511.04215v1
cs.CR, cs.CL
2025-11-08
Авторы:
Hongwei Yao, Yun Xia, Shuo Shao, Haoran Shi, Tong Qiao, Cong Wang
Abstract
Large language models (LLMs) increasingly employ guardrails to enforce
ethical, legal, and application-specific constraints on their outputs. While
effective at mitigating harmful responses, these guardrails introduce a new
class of vulnerabilities by exposing observable decision patterns. In this
work, we present the first study of black-box LLM guardrail reverse-engineering
attacks. We propose Guardrail Reverse-engineering Attack (GRA), a reinforcement
learning-based framework that leverages genetic algorithm-driven data
augmentation to approximate the decision-making policy of victim guardrails. By
iteratively collecting input-output pairs, prioritizing divergence cases, and
applying targeted mutations and crossovers, our method incrementally converges
toward a high-fidelity surrogate of the victim guardrail. We evaluate GRA on
three widely deployed commercial systems, namely ChatGPT, DeepSeek, and Qwen3,
and demonstrate that it achieves an rule matching rate exceeding 0.92 while
requiring less than $85 in API costs. These findings underscore the practical
feasibility of guardrail extraction and highlight significant security risks
for current LLM safety mechanisms. Our findings expose critical vulnerabilities
in current guardrail designs and highlight the urgent need for more robust
defense mechanisms in LLM deployment.
Ссылки и действия
Дополнительные ресурсы: