Towards Reliable and Practical LLM Security Evaluations via Bayesian Modelling

2510.05709v1 cs.CR, cs.AI, cs.CL 2025-10-09

Авторы:

Mary Llewellyn, Annie Gray, Josh Collyer, Michael Harries

Abstract

Before adopting a new large language model (LLM) architecture, it is critical to understand vulnerabilities accurately. Existing evaluations can be difficult to trust, often drawing conclusions from LLMs that are not meaningfully comparable, relying on heuristic inputs or employing metrics that fail to capture the inherent uncertainty. In this paper, we propose a principled and practical end-to-end framework for evaluating LLM vulnerabilities to prompt injection attacks. First, we propose practical approaches to experimental design, tackling unfair LLM comparisons by considering two practitioner scenarios: when training an LLM and when deploying a pre-trained LLM. Second, we address the analysis of experiments and propose a Bayesian hierarchical model with embedding-space clustering. This model is designed to improve uncertainty quantification in the common scenario that LLM outputs are not deterministic, test prompts are designed imperfectly, and practitioners only have a limited amount of compute to evaluate vulnerabilities. We show the improved inferential capabilities of the model in several prompt injection attack settings. Finally, we demonstrate the pipeline to evaluate the security of Transformer versus Mamba architectures. Our findings show that consideration of output variability can suggest less definitive findings. However, for some attacks, we find notably increased Transformer and Mamba-variant vulnerabilities across LLMs with the same training data or mathematical ability.

Ссылки и действия

Читать на arXiv Скачать PDF

Дополнительные ресурсы:

Towards Reliable and Practical LLM Security Evaluations via Bayesian Modelling

Авторы:

Abstract

Ссылки и действия

Связанные статьи

Topology Matters: Measuring Memory Leakage in Multi-Agent LLMs

EmoRAG: Evaluating RAG Robustness to Symbolic Perturbations

Shadows in the Code: Exploring the Risks and Defenses of LLM-based Multi-Agent S...

MURMUR: Using cross-user chatter to break collaborative language agents in group...

GRPO Privacy Is at Risk: A Membership Inference Attack Against Reinforcement Lea...

Навигация