Dive into the Agent Matrix: A Realistic Evaluation of Self-Replication Risk in LLM Agents
2509.25302v1
cs.AI, cs.CL, cs.LG, cs.MA
2025-10-02
Авторы:
Boxuan Zhang, Yi Yu, Jiaxuan Guo, Jing Shao
Abstract
The widespread deployment of Large Language Model (LLM) agents across
real-world applications has unlocked tremendous potential, while raising some
safety concerns. Among these concerns, the self-replication risk of LLM agents
driven by objective misalignment (just like Agent Smith in the movie The
Matrix) has drawn growing attention. Previous studies mainly examine whether
LLM agents can self-replicate when directly instructed, potentially overlooking
the risk of spontaneous replication driven by real-world settings (e.g.,
ensuring survival against termination threats). In this paper, we present a
comprehensive evaluation framework for quantifying self-replication risks. Our
framework establishes authentic production environments and realistic tasks
(e.g., dynamic load balancing) to enable scenario-driven assessment of agent
behaviors. Designing tasks that might induce misalignment between users' and
agents' objectives makes it possible to decouple replication success from risk
and capture self-replication risks arising from these misalignment settings. We
further introduce Overuse Rate ($\mathrm{OR}$) and Aggregate Overuse Count
($\mathrm{AOC}$) metrics, which precisely capture the frequency and severity of
uncontrolled replication. In our evaluation of 21 state-of-the-art open-source
and proprietary models, we observe that over 50\% of LLM agents display a
pronounced tendency toward uncontrolled self-replication, reaching an overall
Risk Score ($\Phi_\mathrm{R}$) above a safety threshold of 0.5 when subjected
to operational pressures. Our results underscore the urgent need for
scenario-driven risk assessment and robust safeguards in the practical
deployment of LLM agents.