Detecting Adversarial Fine-tuning with Auditing Agents
2510.16255v1
cs.CR, cs.AI
2025-10-22
Авторы:
Sarah Egler, John Schulman, Nicholas Carlini
Abstract
Large Language Model (LLM) providers expose fine-tuning APIs that let end
users fine-tune their frontier LLMs. Unfortunately, it has been shown that an
adversary with fine-tuning access to an LLM can bypass safeguards. Particularly
concerning, such attacks may avoid detection with datasets that are only
implicitly harmful. Our work studies robust detection mechanisms for
adversarial use of fine-tuning APIs. We introduce the concept of a fine-tuning
auditing agent and show it can detect harmful fine-tuning prior to model
deployment. We provide our auditing agent with access to the fine-tuning
dataset, as well as the fine-tuned and pre-fine-tuned models, and request the
agent assigns a risk score for the fine-tuning job. We evaluate our detection
approach on a diverse set of eight strong fine-tuning attacks from the
literature, along with five benign fine-tuned models, totaling over 1400
independent audits. These attacks are undetectable with basic content
moderation on the dataset, highlighting the challenge of the task. With the
best set of affordances, our auditing agent achieves a 56.2% detection rate of
adversarial fine-tuning at a 1% false positive rate. Most promising, the
auditor is able to detect covert cipher attacks that evade safety evaluations
and content moderation of the dataset. While benign fine-tuning with
unintentional subtle safety degradation remains a challenge, we establish a
baseline configuration for further work in this area. We release our auditing
agent at https://github.com/safety-research/finetuning-auditor.
Ссылки и действия
Дополнительные ресурсы: