Energy-Driven Steering: Reducing False Refusals in Large Language Models
2510.08646v1
cs.LG, cs.AI, cs.CL, stat.ML
2025-10-14
Авторы:
Eric Hanchen Jiang, Weixuan Ou, Run Liu, Shengyuan Pang, Guancheng Wan, Ranjie Duan, Wei Dong, Kai-Wei Chang, XiaoFeng Wang, Ying Nian Wu, Xinfeng Li
Abstract
Safety alignment of large language models (LLMs) faces a key challenge:
current alignment techniques often only focus on improving safety against
harmful prompts, causing LLMs to become over-cautious and refuse to respond to
benign prompts. Therefore, a key objective of safe alignment is to enhance
safety while simultaneously reducing false refusals. In this paper, we
introduce Energy-Driven Steering (EDS), a novel, fine-tuning free framework
designed to resolve this challenge through dynamic, inference-time
intervention. We trained a lightweight, external Energy-Based Model (EBM) to
assign high energy to undesirable (false refusal or jailbreak) states and low
energy to desirable (helpful response or safe reject) ones. During inference,
EBM maps the LLM's internal activations to an "energy landscape". We use the
gradient of the energy function to dynamically steer the LLM's hidden states to
low energy regions, correcting the model to generate a desirable response in
real-time without modifying its weights. This method decouples behavioral
control from the model's core knowledge, offering a flexible solution with
minimal computational overhead. Extensive experiments across a wide range of
models show our method successfully achieves this objective: it substantially
lowers false refusal rates. For example, raising compliance on the ORB-H
benchmark from 57.3% to 82.6% while maintaining the baseline safety
performance. Our work presents an effective paradigm for building LLMs that
achieve both low false refusal rates and high safety.