Don't Walk the Line: Boundary Guidance for Filtered Generation
2510.11834v1
cs.LG, cs.CL
2025-10-16
Авторы:
Sarah Ball, Andreas Haupt
Abstract
Generative models are increasingly paired with safety classifiers that filter
harmful or undesirable outputs. A common strategy is to fine-tune the generator
to reduce the probability of being filtered, but this can be suboptimal: it
often pushes the model toward producing samples near the classifier's decision
boundary, increasing both false positives and false negatives. We propose
Boundary Guidance, a reinforcement learning fine-tuning method that explicitly
steers generation away from the classifier's margin. On a benchmark of
jailbreak and ambiguous prompts, Boundary Guidance improves both the safety and
the utility of outputs, as judged by LLM-as-a-Judge evaluations. Comprehensive
ablations across model scales and reward designs demonstrate the robustness of
our approach.
Ссылки и действия
Дополнительные ресурсы: