REMIND: Input Loss Landscapes Reveal Residual Memorization in Post-Unlearning LLMs
2511.04228v1
cs.CL, cs.LG, I.2.7; I.2.6; K.4.1
2025-11-08
Авторы:
Liran Cohen, Yaniv Nemcovesky, Avi Mendelson
Abstract
Machine unlearning aims to remove the influence of specific training data
from a model without requiring full retraining. This capability is crucial for
ensuring privacy, safety, and regulatory compliance. Therefore, verifying
whether a model has truly forgotten target data is essential for maintaining
reliability and trustworthiness. However, existing evaluation methods often
assess forgetting at the level of individual inputs. This approach may overlook
residual influence present in semantically similar examples. Such influence can
compromise privacy and lead to indirect information leakage. We propose REMIND
(Residual Memorization In Neighborhood Dynamics), a novel evaluation method
aiming to detect the subtle remaining influence of unlearned data and classify
whether the data has been effectively forgotten. REMIND analyzes the model's
loss over small input variations and reveals patterns unnoticed by single-point
evaluations. We show that unlearned data yield flatter, less steep loss
landscapes, while retained or unrelated data exhibit sharper, more volatile
patterns. REMIND requires only query-based access, outperforms existing methods
under similar constraints, and demonstrates robustness across different models,
datasets, and paraphrased inputs, making it practical for real-world
deployment. By providing a more sensitive and interpretable measure of
unlearning effectiveness, REMIND provides a reliable framework to assess
unlearning in language models. As a result, REMIND offers a novel perspective
on memorization and unlearning.