Belief Dynamics Reveal the Dual Nature of In-Context Learning and Activation Steering
2511.00617v1
cs.LG, cs.AI, cs.CL, stat.ML
2025-11-06
Авторы:
Eric Bigelow, Daniel Wurgaft, YingQiao Wang, Noah Goodman, Tomer Ullman, Hidenori Tanaka, Ekdeep Singh Lubana
Abstract
Large language models (LLMs) can be controlled at inference time through
prompts (in-context learning) and internal activations (activation steering).
Different accounts have been proposed to explain these methods, yet their
common goal of controlling model behavior raises the question of whether these
seemingly disparate methodologies can be seen as specific instances of a
broader framework. Motivated by this, we develop a unifying, predictive account
of LLM control from a Bayesian perspective. Specifically, we posit that both
context- and activation-based interventions impact model behavior by altering
its belief in latent concepts: steering operates by changing concept priors,
while in-context learning leads to an accumulation of evidence. This results in
a closed-form Bayesian model that is highly predictive of LLM behavior across
context- and activation-based interventions in a set of domains inspired by
prior work on many-shot in-context learning. This model helps us explain prior
empirical phenomena - e.g., sigmoidal learning curves as in-context evidence
accumulates - while predicting novel ones - e.g., additivity of both
interventions in log-belief space, which results in distinct phases such that
sudden and dramatic behavioral shifts can be induced by slightly changing
intervention controls. Taken together, this work offers a unified account of
prompt-based and activation-based control of LLM behavior, and a methodology
for empirically predicting the effects of these interventions.