Quantifying CBRN Risk in Frontier Models
2510.21133v1
cs.CR, cs.AI
2025-10-28
Авторы:
Divyanshu Kumar, Nitin Aravind Birur, Tanay Baswa, Sahil Agarwal, Prashanth Harshangi
Abstract
Frontier Large Language Models (LLMs) pose unprecedented dual-use risks
through the potential proliferation of chemical, biological, radiological, and
nuclear (CBRN) weapons knowledge. We present the first comprehensive evaluation
of 10 leading commercial LLMs against both a novel 200-prompt CBRN dataset and
a 180-prompt subset of the FORTRESS benchmark, using a rigorous three-tier
attack methodology. Our findings expose critical safety vulnerabilities: Deep
Inception attacks achieve 86.0\% success versus 33.8\% for direct requests,
demonstrating superficial filtering mechanisms; Model safety performance varies
dramatically from 2\% (claude-opus-4) to 96\% (mistral-small-latest) attack
success rates; and eight models exceed 70\% vulnerability when asked to enhance
dangerous material properties. We identify fundamental brittleness in current
safety alignment, where simple prompt engineering techniques bypass safeguards
for dangerous CBRN information. These results challenge industry safety claims
and highlight urgent needs for standardized evaluation frameworks, transparent
safety metrics, and more robust alignment techniques to mitigate catastrophic
misuse risks while preserving beneficial capabilities.
Ссылки и действия
Дополнительные ресурсы: