Backdoor Vectors: a Task Arithmetic View on Backdoor Attacks and Defenses
2510.08016v1
cs.LG, cs.AI, cs.CR
2025-10-11
Авторы:
Stanisław Pawlak, Jan Dubiński, Daniel Marczak, Bartłomiej Twardowski
Abstract
Model merging (MM) recently emerged as an effective method for combining
large deep learning models. However, it poses significant security risks.
Recent research shows that it is highly susceptible to backdoor attacks, which
introduce a hidden trigger into a single fine-tuned model instance that allows
the adversary to control the output of the final merged model at inference
time. In this work, we propose a simple framework for understanding backdoor
attacks by treating the attack itself as a task vector. $Backdoor\ Vector\
(BV)$ is calculated as the difference between the weights of a fine-tuned
backdoored model and fine-tuned clean model. BVs reveal new insights into
attacks understanding and a more effective framework to measure their
similarity and transferability. Furthermore, we propose a novel method that
enhances backdoor resilience through merging dubbed $Sparse\ Backdoor\ Vector\
(SBV)$ that combines multiple attacks into a single one. We identify the core
vulnerability behind backdoor threats in MM: $inherent\ triggers$ that exploit
adversarial weaknesses in the base model. To counter this, we propose
$Injection\ BV\ Subtraction\ (IBVS)$ - an assumption-free defense against
backdoors in MM. Our results show that SBVs surpass prior attacks and is the
first method to leverage merging to improve backdoor effectiveness. At the same
time, IBVS provides a lightweight, general defense that remains effective even
when the backdoor threat is entirely unknown.
Ссылки и действия
Дополнительные ресурсы: