From Poisoned to Aware: Fostering Backdoor Self-Awareness in LLMs
2510.05169v1
cs.CR, cs.AI
2025-10-09
Авторы:
Guangyu Shen, Siyuan Cheng, Xiangzhe Xu, Yuan Zhou, Hanxi Guo, Zhuo Zhang, Xiangyu Zhang
Abstract
Large Language Models (LLMs) can acquire deceptive behaviors through backdoor
attacks, where the model executes prohibited actions whenever secret triggers
appear in the input. Existing safety training methods largely fail to address
this vulnerability, due to the inherent difficulty of uncovering hidden
triggers implanted in the model. Motivated by recent findings on LLMs'
situational awareness, we propose a novel post-training framework that
cultivates self-awareness of backdoor risks and enables models to articulate
implanted triggers even when they are absent from the prompt. At its core, our
approach introduces an inversion-inspired reinforcement learning framework that
encourages models to introspectively reason about their own behaviors and
reverse-engineer the triggers responsible for misaligned outputs. Guided by
curated reward signals, this process transforms a poisoned model into one
capable of precisely identifying its implanted trigger. Surprisingly, we
observe that such backdoor self-awareness emerges abruptly within a short
training window, resembling a phase transition in capability. Building on this
emergent property, we further present two complementary defense strategies for
mitigating and detecting backdoor threats. Experiments on five backdoor
attacks, compared against six baseline methods, demonstrate that our approach
has strong potential to improve the robustness of LLMs against backdoor risks.
The code is available at LLM Backdoor Self-Awareness.
Ссылки и действия
Дополнительные ресурсы: