Are My Optimized Prompts Compromised? Exploring Vulnerabilities of LLM-based Optimizers
2510.14381v1
cs.LG, cs.AI, cs.CL, cs.CR
2025-10-18
Авторы:
Andrew Zhao, Reshmi Ghosh, Vitor Carvalho, Emily Lawton, Keegan Hines, Gao Huang, Jack W. Stokes
Abstract
Large language model (LLM) systems now underpin everyday AI applications such
as chatbots, computer-use assistants, and autonomous robots, where performance
often depends on carefully designed prompts. LLM-based prompt optimizers reduce
that effort by iteratively refining prompts from scored feedback, yet the
security of this optimization stage remains underexamined. We present the first
systematic analysis of poisoning risks in LLM-based prompt optimization. Using
HarmBench, we find systems are substantially more vulnerable to manipulated
feedback than to injected queries: feedback-based attacks raise attack success
rate (ASR) by up to $\Delta$ASR = 0.48. We introduce a simple fake-reward
attack that requires no access to the reward model and significantly increases
vulnerability, and we propose a lightweight highlighting defense that reduces
the fake-reward $\Delta$ASR from 0.23 to 0.07 without degrading utility. These
results establish prompt optimization pipelines as a first-class attack surface
and motivate stronger safeguards for feedback channels and optimization
frameworks.