Bits Leaked per Query: Information-Theoretic Bounds on Adversarial Attacks against LLMs
2510.17000v1
cs.CR, cs.CL, cs.LG
2025-10-22
Авторы:
Masahiro Kaneko, Timothy Baldwin
Abstract
Adversarial attacks by malicious users that threaten the safety of large
language models (LLMs) can be viewed as attempts to infer a target property $T$
that is unknown when an instruction is issued, and becomes knowable only after
the model's reply is observed. Examples of target properties $T$ include the
binary flag that triggers an LLM's harmful response or rejection, and the
degree to which information deleted by unlearning can be restored, both
elicited via adversarial instructions. The LLM reveals an \emph{observable
signal} $Z$ that potentially leaks hints for attacking through a response
containing answer tokens, thinking process tokens, or logits. Yet the scale of
information leaked remains anecdotal, leaving auditors without principled
guidance and defenders blind to the transparency--risk trade-off. We fill this
gap with an information-theoretic framework that computes how much information
can be safely disclosed, and enables auditors to gauge how close their methods
come to the fundamental limit. Treating the mutual information $I(Z;T)$ between
the observation $Z$ and the target property $T$ as the leaked bits per query,
we show that achieving error $\varepsilon$ requires at least
$\log(1/\varepsilon)/I(Z;T)$ queries, scaling linearly with the inverse leak
rate and only logarithmically with the desired accuracy. Thus, even a modest
increase in disclosure collapses the attack cost from quadratic to logarithmic
in terms of the desired accuracy. Experiments on seven LLMs across
system-prompt leakage, jailbreak, and relearning attacks corroborate the
theory: exposing answer tokens alone requires about a thousand queries; adding
logits cuts this to about a hundred; and revealing the full thinking process
trims it to a few dozen. Our results provide the first principled yardstick for
balancing transparency and security when deploying LLMs.
Ссылки и действия
Дополнительные ресурсы: