Machine Unlearning Meets Adversarial Robustness via Constrained Interventions on LLMs
2510.03567v1
cs.LG, cs.CL, cs.CR, cs.CY, math.OC
2025-10-08
Авторы:
Fatmazohra Rezkellah, Ramzi Dakhmouche
Abstract
With the increasing adoption of Large Language Models (LLMs), more
customization is needed to ensure privacy-preserving and safe generation. We
address this objective from two critical aspects: unlearning of sensitive
information and robustness to jail-breaking attacks. We investigate various
constrained optimization formulations that address both aspects in a
\emph{unified manner}, by finding the smallest possible interventions on LLM
weights that either make a given vocabulary set unreachable or embed the LLM
with robustness to tailored attacks by shifting part of the weights to a
\emph{safer} region. Beyond unifying two key properties, this approach
contrasts with previous work in that it doesn't require an oracle classifier
that is typically not available or represents a computational overhead.
Surprisingly, we find that the simplest point-wise constraint-based
intervention we propose leads to better performance than max-min interventions,
while having a lower computational cost. Comparison against state-of-the-art
defense methods demonstrates superior performance of the proposed approach.