From Memorization to Reasoning in the Spectrum of Loss Curvature
2510.24256v1
cs.CL, cs.LG
2025-10-30
Авторы:
Jack Merullo, Srihita Vatsavaya, Lucius Bushnaq, Owen Lewis
Abstract
We characterize how memorization is represented in transformer models and
show that it can be disentangled in the weights of both language models (LMs)
and vision transformers (ViTs) using a decomposition based on the loss
landscape curvature. This insight is based on prior theoretical and empirical
work showing that the curvature for memorized training points is much sharper
than non memorized, meaning ordering weight components from high to low
curvature can reveal a distinction without explicit labels. This motivates a
weight editing procedure that suppresses far more recitation of untargeted
memorized data more effectively than a recent unlearning method
(BalancedSubnet), while maintaining lower perplexity. Since the basis of
curvature has a natural interpretation for shared structure in model weights,
we analyze the editing procedure extensively on its effect on downstream tasks
in LMs, and find that fact retrieval and arithmetic are specifically and
consistently negatively affected, even though open book fact retrieval and
general logical reasoning is conserved. We posit these tasks rely heavily on
specialized directions in weight space rather than general purpose mechanisms,
regardless of whether those individual datapoints are memorized. We support
this by showing a correspondence between task data's activation strength with
low curvature components that we edit out, and the drop in task performance
after the edit. Our work enhances the understanding of memorization in neural
networks with practical applications towards removing it, and provides evidence
for idiosyncratic, narrowly-used structures involved in solving tasks like math
and fact retrieval.
Ссылки и действия
Дополнительные ресурсы: