LLMSymGuard: A Symbolic Safety Guardrail Framework Leveraging Interpretable Jailbreak Concepts
2508.16325v1
cs.CL, cs.AI, cs.SC
2025-08-26
Авторы:
Darpan Aswal, Céline Hudelot
Резюме на русском
## Контекст
Проблемы безопасности в области бо LLM (Large Language Models) остаются высокими, несмотря на значительные усилия по их устранению. Особенно актуальной является проблема "jailbreak" - способов скрытого проникновения в модель для получения нежелательного или злонамеренного контента. Это часто приводит к таким проблемам, как целенаправленное использование моделей для нанесения вреда, а также к акц
Abstract
Large Language Models have found success in a variety of applications;
however, their safety remains a matter of concern due to the existence of
various types of jailbreaking methods. Despite significant efforts, alignment
and safety fine-tuning only provide a certain degree of robustness against
jailbreak attacks that covertly mislead LLMs towards the generation of harmful
content. This leaves them prone to a number of vulnerabilities, ranging from
targeted misuse to accidental profiling of users. This work introduces
\textbf{LLMSymGuard}, a novel framework that leverages Sparse Autoencoders
(SAEs) to identify interpretable concepts within LLM internals associated with
different jailbreak themes. By extracting semantically meaningful internal
representations, LLMSymGuard enables building symbolic, logical safety
guardrails -- offering transparent and robust defenses without sacrificing
model capabilities or requiring further fine-tuning. Leveraging advances in
mechanistic interpretability of LLMs, our approach demonstrates that LLMs learn
human-interpretable concepts from jailbreaks, and provides a foundation for
designing more interpretable and logical safeguard measures against attackers.
Code will be released upon publication.
Ссылки и действия
Дополнительные ресурсы: