SecureBERT 2.0: Advanced Language Model for Cybersecurity Intelligence
2510.00240v1
cs.CR, cs.AI, cs.LG
2025-10-05
Авторы:
Ehsan Aghaei, Sarthak Jain, Prashanth Arun, Arjun Sambamoorthy
Abstract
Effective analysis of cybersecurity and threat intelligence data demands
language models that can interpret specialized terminology, complex document
structures, and the interdependence of natural language and source code.
Encoder-only transformer architectures provide efficient and robust
representations that support critical tasks such as semantic search, technical
entity extraction, and semantic analysis, which are key to automated threat
detection, incident triage, and vulnerability assessment. However,
general-purpose language models often lack the domain-specific adaptation
required for high precision. We present SecureBERT 2.0, an enhanced
encoder-only language model purpose-built for cybersecurity applications.
Leveraging the ModernBERT architecture, SecureBERT 2.0 introduces improved
long-context modeling and hierarchical encoding, enabling effective processing
of extended and heterogeneous documents, including threat reports and source
code artifacts. Pretrained on a domain-specific corpus more than thirteen times
larger than its predecessor, comprising over 13 billion text tokens and 53
million code tokens from diverse real-world sources, SecureBERT 2.0 achieves
state-of-the-art performance on multiple cybersecurity benchmarks. Experimental
results demonstrate substantial improvements in semantic search for threat
intelligence, semantic analysis, cybersecurity-specific named entity
recognition, and automated vulnerability detection in code within the
cybersecurity domain.
Ссылки и действия
Дополнительные ресурсы: