Steering Language Models with Weight Arithmetic
2511.05408v1
cs.CL, cs.LG
2025-11-11
Авторы:
Constanza Fierro, Fabien Roger
Abstract
Providing high-quality feedback to Large Language Models (LLMs) on a diverse
training distribution can be difficult and expensive, and providing feedback
only on a narrow distribution can result in unintended generalizations. To
better leverage narrow training data, we propose contrastive weight steering, a
simple post-training method that edits the model parameters using weight
arithmetic. We isolate a behavior direction in weight-space by subtracting the
weight deltas from two small fine-tunes -- one that induces the desired
behavior and another that induces its opposite -- and then add or remove this
direction to modify the model's weights. We apply this technique to mitigate
sycophancy and induce misalignment, and find that weight steering often
generalizes further than activation steering, achieving stronger
out-of-distribution behavioral control before degrading general capabilities.
We also show that, in the context of task-specific fine-tuning, weight steering
can partially mitigate undesired behavioral drift: it can reduce sycophancy and
under-refusals introduced during fine-tuning while preserving task performance
gains. Finally, we provide preliminary evidence that emergent misalignment can
be detected by measuring the similarity between fine-tuning updates and an
"evil" weight direction, suggesting that it may be possible to monitor the
evolution of weights during training and detect rare misaligned behaviors that
never manifest during training or evaluations.
Ссылки и действия
Дополнительные ресурсы: