P2P: A Poison-to-Poison Remedy for Reliable Backdoor Defense in LLMs
2510.04503v1
cs.CR, cs.AI, cs.CL
2025-10-08
Авторы:
Shuai Zhao, Xinyi Wu, Shiqian Zhao, Xiaobao Wu, Zhongliang Guo, Yanhao Jia, Anh Tuan Luu
Abstract
During fine-tuning, large language models (LLMs) are increasingly vulnerable
to data-poisoning backdoor attacks, which compromise their reliability and
trustworthiness. However, existing defense strategies suffer from limited
generalization: they only work on specific attack types or task settings. In
this study, we propose Poison-to-Poison (P2P), a general and effective backdoor
defense algorithm. P2P injects benign triggers with safe alternative labels
into a subset of training samples and fine-tunes the model on this re-poisoned
dataset by leveraging prompt-based learning. This enforces the model to
associate trigger-induced representations with safe outputs, thereby overriding
the effects of original malicious triggers. Thanks to this robust and
generalizable trigger-based fine-tuning, P2P is effective across task settings
and attack types. Theoretically and empirically, we show that P2P can
neutralize malicious backdoors while preserving task performance. We conduct
extensive experiments on classification, mathematical reasoning, and summary
generation tasks, involving multiple state-of-the-art LLMs. The results
demonstrate that our P2P algorithm significantly reduces the attack success
rate compared with baseline models. We hope that the P2P can serve as a
guideline for defending against backdoor attacks and foster the development of
a secure and trustworthy LLM community.
Ссылки и действия
Дополнительные ресурсы: