Audit the Whisper: Detecting Steganographic Collusion in Multi-Agent LLMs
2510.04303v1
cs.MA, cs.AI
2025-10-08
Авторы:
Om Tailor
Abstract
Multi-agent deployments of large language models (LLMs) are increasingly
embedded in market, allocation, and governance workflows, yet covert
coordination among agents can silently erode trust and social welfare. Existing
audits are dominated by heuristics that lack theoretical guarantees, struggle
to transfer across tasks, and seldom ship with the infrastructure needed for
independent replication. We introduce \emph{Audit the Whisper}, a
conference-grade research artifact that spans theory, benchmark design,
detection, and reproducibility. Our contributions are: (i) a channel-capacity
analysis showing how interventions such as paraphrase, rate limiting, and role
permutation impose quantifiable capacity penalties -- operationalized via
paired-run Kullback--Leibler diagnostics -- that tighten mutual-information
thresholds with finite-sample guarantees; (ii) \textsc{ColludeBench}-v0,
covering pricing, first-price auctions, and peer review with configurable
covert schemes, deterministic manifests, and reward instrumentation; and (iii)
a calibrated auditing pipeline that fuses cross-run mutual information,
permutation invariance, watermark variance, and fairness-aware acceptance bias,
each tuned to a \(10^{-3}\) false-positive budget. Across 600 audited runs
spanning 12 intervention conditions, the union meta-test attains TPR~$=1$ with
zero observed false alarms, while ablations surface the price-of-auditing
trade-off and highlight fairness-driven colluders invisible to MI alone. We
release regeneration scripts, seed-stamped manifests, and documentation so that
external auditors can reproduce every figure and extend the framework with
minimal effort.
Ссылки и действия
Дополнительные ресурсы: