Alignment is Localized: A Causal Probe into Preference Layers
2510.16167v1
cs.LG, cs.CL
2025-10-22
Авторы:
Archie Chaudhury
Abstract
Reinforcement Learning frameworks, particularly those utilizing human
annotations, have become an increasingly popular method for preference
fine-tuning, where the outputs of a language model are tuned to match a certain
set of behavioral policies or guidelines. Reinforcement Learning through Human
Feedback (RLHF) is perhaps the most popular implementation of such a framework,
particularly for aligning LMs toward safety and human intent. However, the
internal workings of how such alignment is achieved remain largely opaque. In
this work, we systematically analyze preference optimization for language model
alignment by applying layer-wide causal patching between a base model and its
tuned counterpart across human preference pairs. We implement our methodology
on \textit{Llama-3.2-1B}, and find that alignment is spatially localized:
mid-layer activations encode a distinct subspace that causally determines
reward-consistent behavior, while early and late layers remain largely
unaffected. Utilizing LASSO regression, we also find that only a small number
of layers possess non-zero coefficients linking activation distances to reward
gains. Overall, we show that, at least for some language models, alignment from
human-based, preferential tuning is a directional, low rank process, rather
than diffuse and parameteric.
Ссылки и действия
Дополнительные ресурсы: