Forgetting to Forget: Attention Sink as A Gateway for Backdooring LLM Unlearning
2510.17021v1
cs.LG, cs.CL
2025-10-22
Авторы:
Bingqi Shang, Yiwei Chen, Yihua Zhang, Bingquan Shen, Sijia Liu
Abstract
Large language model (LLM) unlearning has become a critical mechanism for
removing undesired data, knowledge, or behaviors from pre-trained models while
retaining their general utility. Yet, with the rise of open-weight LLMs, we
ask: can the unlearning process itself be backdoored, appearing successful
under normal conditions yet reverting to pre-unlearned behavior when a hidden
trigger is activated? Drawing inspiration from classical backdoor attacks that
embed triggers into training data to enforce specific behaviors, we investigate
backdoor unlearning, where models forget as intended in the clean setting but
recover forgotten knowledge when the trigger appears. We show that designing
such attacks presents unique challenges, hinging on where triggers are placed
and how backdoor training is reinforced. We uncover a strong link between
backdoor efficacy and the attention sink phenomenon, i.e., shallow input tokens
consistently attract disproportionate attention in LLMs. Our analysis reveals
that these attention sinks serve as gateways for backdoor unlearning: placing
triggers at sink positions and aligning their attention values markedly
enhances backdoor persistence. Extensive experiments validate these findings,
showing that attention-sink-guided backdoor unlearning reliably restores
forgotten knowledge in the presence of backdoor triggers, while behaving
indistinguishably from a normally unlearned model when triggers are absent.
Code is available at https://github.com/OPTML-Group/Unlearn-Backdoor.
Ссылки и действия
Дополнительные ресурсы: