Backdoor Unlearning by Linear Task Decomposition
2510.14845v1
cs.LG, cs.CV
2025-10-18
Авторы:
Amel Abdelraheem, Alessandro Favero, Gerome Bovet, Pascal Frossard
Abstract
Foundation models have revolutionized computer vision by enabling broad
generalization across diverse tasks. Yet, they remain highly susceptible to
adversarial perturbations and targeted backdoor attacks. Mitigating such
vulnerabilities remains an open challenge, especially given that the
large-scale nature of the models prohibits retraining to ensure safety.
Existing backdoor removal approaches rely on costly fine-tuning to override the
harmful behavior, and can often degrade performance on other unrelated tasks.
This raises the question of whether backdoors can be removed without
compromising the general capabilities of the models. In this work, we address
this question and study how backdoors are encoded in the model weight space,
finding that they are disentangled from other benign tasks. Specifically, this
separation enables the isolation and erasure of the backdoor's influence on the
model with minimal impact on clean performance. Building on this insight, we
introduce a simple unlearning method that leverages such disentanglement.
Through extensive experiments with CLIP-based models and common adversarial
triggers, we show that, given the knowledge of the attack, our method achieves
approximately perfect unlearning, while retaining, on average, 96% of clean
accuracy. Additionally, we demonstrate that even when the attack and its
presence are unknown, our method successfully unlearns backdoors by proper
estimation using reverse-engineered triggers. Overall, our method consistently
yields better unlearning and clean accuracy tradeoffs when compared to present
state-of-the-art defenses.
Ссылки и действия
Дополнительные ресурсы: