PATCH: Mitigating PII Leakage in Language Models with Privacy-Aware Targeted Circuit PatcHing
2510.07452v1
cs.CR, cs.CL
2025-10-11
Авторы:
Anthony Hughes, Vasisht Duddu, N. Asokan, Nikolaos Aletras, Ning Ma
Abstract
Language models (LMs) may memorize personally identifiable information (PII)
from training data, enabling adversaries to extract it during inference.
Existing defense mechanisms such as differential privacy (DP) reduce this
leakage, but incur large drops in utility. Based on a comprehensive study using
circuit discovery to identify the computational circuits responsible PII
leakage in LMs, we hypothesize that specific PII leakage circuits in LMs should
be responsible for this behavior. Therefore, we propose PATCH (Privacy-Aware
Targeted Circuit PatcHing), a novel approach that first identifies and
subsequently directly edits PII circuits to reduce leakage. PATCH achieves
better privacy-utility trade-off than existing defenses, e.g., reducing recall
of PII leakage from LMs by up to 65%. Finally, PATCH can be combined with DP to
reduce recall of residual leakage of an LM to as low as 0.01%. Our analysis
shows that PII leakage circuits persist even after the application of existing
defense mechanisms. In contrast, PATCH can effectively mitigate their impact.
Ссылки и действия
Дополнительные ресурсы: