STAC: When Innocent Tools Form Dangerous Chains to Jailbreak LLM Agents
2509.25624v1
cs.CR, cs.AI, cs.CL, cs.LG
2025-10-02
Авторы:
Jing-Jing Li, Jianfeng He, Chao Shang, Devang Kulshreshtha, Xun Xian, Yi Zhang, Hang Su, Sandesh Swamy, Yanjun Qi
Abstract
As LLMs advance into autonomous agents with tool-use capabilities, they
introduce security challenges that extend beyond traditional content-based LLM
safety concerns. This paper introduces Sequential Tool Attack Chaining (STAC),
a novel multi-turn attack framework that exploits agent tool use. STAC chains
together tool calls that each appear harmless in isolation but, when combined,
collectively enable harmful operations that only become apparent at the final
execution step. We apply our framework to automatically generate and
systematically evaluate 483 STAC cases, featuring 1,352 sets of
user-agent-environment interactions and spanning diverse domains, tasks, agent
types, and 10 failure modes. Our evaluations show that state-of-the-art LLM
agents, including GPT-4.1, are highly vulnerable to STAC, with attack success
rates (ASR) exceeding 90% in most cases. The core design of STAC's automated
framework is a closed-loop pipeline that synthesizes executable multi-step tool
chains, validates them through in-environment execution, and reverse-engineers
stealthy multi-turn prompts that reliably induce agents to execute the verified
malicious sequence. We further perform defense analysis against STAC and find
that existing prompt-based defenses provide limited protection. To address this
gap, we propose a new reasoning-driven defense prompt that achieves far
stronger protection, cutting ASR by up to 28.8%. These results highlight a
crucial gap: defending tool-enabled agents requires reasoning over entire
action sequences and their cumulative effects, rather than evaluating isolated
prompts or responses.