Black-box Optimization of LLM Outputs by Asking for Directions
2510.16794v1
cs.CR, cs.LG
2025-10-22
Авторы:
Jie Zhang, Meng Ding, Yang Liu, Jue Hong, Florian Tramèr
Abstract
We present a novel approach for attacking black-box large language models
(LLMs) by exploiting their ability to express confidence in natural language.
Existing black-box attacks require either access to continuous model outputs
like logits or confidence scores (which are rarely available in practice), or
rely on proxy signals from other models. Instead, we demonstrate how to prompt
LLMs to express their internal confidence in a way that is sufficiently
calibrated to enable effective adversarial optimization. We apply our general
method to three attack scenarios: adversarial examples for vision-LLMs,
jailbreaks and prompt injections. Our attacks successfully generate malicious
inputs against systems that only expose textual outputs, thereby dramatically
expanding the attack surface for deployed LLMs. We further find that better and
larger models exhibit superior calibration when expressing confidence, creating
a concerning security paradox where model capability improvements directly
enhance vulnerability. Our code is available at this
[link](https://github.com/zj-jayzhang/black_box_llm_optimization).
Ссылки и действия
Дополнительные ресурсы: