Death by a Thousand Prompts: Open Model Vulnerability Analysis
2511.03247v1
cs.CR, cs.LG
2025-11-07
Авторы:
Amy Chang, Nicholas Conley, Harish Santhanalakshmi Ganesan, Adam Swanda
Abstract
Open-weight models provide researchers and developers with accessible
foundations for diverse downstream applications. We tested the safety and
security postures of eight open-weight large language models (LLMs) to identify
vulnerabilities that may impact subsequent fine-tuning and deployment. Using
automated adversarial testing, we measured each model's resilience against
single-turn and multi-turn prompt injection and jailbreak attacks. Our findings
reveal pervasive vulnerabilities across all tested models, with multi-turn
attacks achieving success rates between 25.86\% and 92.78\% -- representing a
$2\times$ to $10\times$ increase over single-turn baselines. These results
underscore a systemic inability of current open-weight models to maintain
safety guardrails across extended interactions. We assess that alignment
strategies and lab priorities significantly influence resilience:
capability-focused models such as Llama 3.3 and Qwen 3 demonstrate higher
multi-turn susceptibility, whereas safety-oriented designs such as Google Gemma
3 exhibit more balanced performance.
The analysis concludes that open-weight models, while crucial for innovation,
pose tangible operational and ethical risks when deployed without layered
security controls. These findings are intended to inform practitioners and
developers of the potential risks and the value of professional AI security
solutions to mitigate exposure. Addressing multi-turn vulnerabilities is
essential to ensure the safe, reliable, and responsible deployment of
open-weight LLMs in enterprise and public domains. We recommend adopting a
security-first design philosophy and layered protections to ensure resilient
deployments of open-weight models.
Ссылки и действия
Дополнительные ресурсы: