Unified Threat Detection and Mitigation Framework (UTDMF): Combating Prompt Injection, Deception, and Bias in Enterprise-Scale Transformers
2510.04528v1
cs.CR, cs.AI
2025-10-08
Авторы:
Santhosh KumarRavindran
Abstract
The rapid adoption of large language models (LLMs) in enterprise systems
exposes vulnerabilities to prompt injection attacks, strategic deception, and
biased outputs, threatening security, trust, and fairness. Extending our
adversarial activation patching framework (arXiv:2507.09406), which induced
deception in toy networks at a 23.9% rate, we introduce the Unified Threat
Detection and Mitigation Framework (UTDMF), a scalable, real-time pipeline for
enterprise-grade models like Llama-3.1 (405B), GPT-4o, and Claude-3.5. Through
700+ experiments per model, UTDMF achieves: (1) 92% detection accuracy for
prompt injection (e.g., jailbreaking); (2) 65% reduction in deceptive outputs
via enhanced patching; and (3) 78% improvement in fairness metrics (e.g.,
demographic bias). Novel contributions include a generalized patching algorithm
for multi-threat detection, three groundbreaking hypotheses on threat
interactions (e.g., threat chaining in enterprise workflows), and a
deployment-ready toolkit with APIs for enterprise integration.
Ссылки и действия
Дополнительные ресурсы: