Sentra-Guard: A Multilingual Human-AI Framework for Real-Time Defense Against Adversarial LLM Jailbreaks
2510.22628v1
cs.CR, cs.AI
2025-10-29
Авторы:
Md. Mehedi Hasan, Ziaur Rahman, Rafid Mostafiz, Md. Abir Hossain
Abstract
This paper presents a real-time modular defense system named Sentra-Guard.
The system detects and mitigates jailbreak and prompt injection attacks
targeting large language models (LLMs). The framework uses a hybrid
architecture with FAISS-indexed SBERT embedding representations that capture
the semantic meaning of prompts, combined with fine-tuned transformer
classifiers, which are machine learning models specialized for distinguishing
between benign and adversarial language inputs. It identifies adversarial
prompts in both direct and obfuscated attack vectors. A core innovation is the
classifier-retriever fusion module, which dynamically computes context-aware
risk scores that estimate how likely a prompt is to be adversarial based on its
content and context. The framework ensures multilingual resilience with a
language-agnostic preprocessing layer. This component automatically translates
non-English prompts into English for semantic evaluation, enabling consistent
detection across over 100 languages. The system includes a HITL feedback loop,
where decisions made by the automated system are reviewed by human experts for
continual learning and rapid adaptation under adversarial pressure.
Sentra-Guard maintains an evolving dual-labeled knowledge base of benign and
malicious prompts, enhancing detection reliability and reducing false
positives. Evaluation results show a 99.96% detection rate (AUC = 1.00, F1 =
1.00) and an attack success rate (ASR) of only 0.004%. This outperforms leading
baselines such as LlamaGuard-2 (1.3%) and OpenAI Moderation (3.7%). Unlike
black-box approaches, Sentra-Guard is transparent, fine-tunable, and compatible
with diverse LLM backends. Its modular design supports scalable deployment in
both commercial and open-source environments. The system establishes a new
state-of-the-art in adversarial LLM defense.
Ссылки и действия
Дополнительные ресурсы: