Prompt Injection as an Emerging Threat: Evaluating the Resilience of Large Language Models
2511.01634v1
cs.CR, cs.AI
2025-11-06
Авторы:
Daniyal Ganiuly, Assel Smaiyl
Abstract
Large Language Models (LLMs) are increasingly used in intelligent systems
that perform reasoning, summarization, and code generation. Their ability to
follow natural-language instructions, while powerful, also makes them
vulnerable to a new class of attacks known as prompt injection. In these
attacks, hidden or malicious instructions are inserted into user inputs or
external content, causing the model to ignore its intended task or produce
unsafe responses. This study proposes a unified framework for evaluating how
resistant Large Language Models (LLMs) are to prompt injection attacks. The
framework defines three complementary metrics such as the Resilience
Degradation Index (RDI), Safety Compliance Coefficient (SCC), and Instructional
Integrity Metric (IIM) to jointly measure robustness, safety, and semantic
stability. We evaluated four instruction-tuned models (GPT-4, GPT-4o, LLaMA-3
8B Instruct, and Flan-T5-Large) on five common language tasks: question
answering, summarization, translation, reasoning, and code generation. Results
show that GPT-4 performs best overall, while open-weight models remain more
vulnerable. The findings highlight that strong alignment and safety tuning are
more important for resilience than model size alone. Results show that all
models remain partially vulnerable, especially to indirect and direct-override
attacks. GPT-4 achieved the best overall resilience (RDR = 9.8 %, SCR = 96.4
%), while open-source models exhibited higher performance degradation and lower
safety scores. The findings demonstrate that alignment strength and safety
tuning play a greater role in resilience than model size alone. The proposed
framework offers a structured, reproducible approach for assessing model
robustness and provides practical insights for improving LLM safety and
reliability.
Ссылки и действия
Дополнительные ресурсы: