Sanitize Your Responses: Mitigating Privacy Leakage in Large Language Models
2509.24488v1
cs.CL, cs.CR, cs.LG
2025-10-01
Авторы:
Wenjie Fu, Huandong Wang, Junyao Gao, Guoan Wan, Tao Jiang
Abstract
As Large Language Models (LLMs) achieve remarkable success across a wide
range of applications, such as chatbots and code copilots, concerns surrounding
the generation of harmful content have come increasingly into focus. Despite
significant advances in aligning LLMs with safety and ethical standards,
adversarial prompts can still be crafted to elicit undesirable responses.
Existing mitigation strategies are predominantly based on post-hoc filtering,
which introduces substantial latency or computational overhead, and is
incompatible with token-level streaming generation. In this work, we introduce
Self-Sanitize, a novel LLM-driven mitigation framework inspired by cognitive
psychology, which emulates human self-monitor and self-repair behaviors during
conversations. Self-Sanitize comprises a lightweight Self-Monitor module that
continuously inspects high-level intentions within the LLM at the token level
via representation engineering, and a Self-Repair module that performs in-place
correction of harmful content without initiating separate review dialogues.
This design allows for real-time streaming monitoring and seamless repair, with
negligible impact on latency and resource utilization. Given that
privacy-invasive content has often been insufficiently focused in previous
studies, we perform extensive experiments on four LLMs across three privacy
leakage scenarios. The results demonstrate that Self-Sanitize achieves superior
mitigation performance with minimal overhead and without degrading the utility
of LLMs, offering a practical and robust solution for safer LLM deployments.
Our code is available at the following link:
https://github.com/wjfu99/LLM_Self_Sanitize
Ссылки и действия
Дополнительные ресурсы: