STAC: When Innocent Tools Form Dangerous Chains to Jailbreak LLM Agents

2509.25624v1 cs.CR, cs.AI, cs.CL, cs.LG 2025-10-02

Авторы:

Jing-Jing Li, Jianfeng He, Chao Shang, Devang Kulshreshtha, Xun Xian, Yi Zhang, Hang Su, Sandesh Swamy, Yanjun Qi

Abstract

As LLMs advance into autonomous agents with tool-use capabilities, they introduce security challenges that extend beyond traditional content-based LLM safety concerns. This paper introduces Sequential Tool Attack Chaining (STAC), a novel multi-turn attack framework that exploits agent tool use. STAC chains together tool calls that each appear harmless in isolation but, when combined, collectively enable harmful operations that only become apparent at the final execution step. We apply our framework to automatically generate and systematically evaluate 483 STAC cases, featuring 1,352 sets of user-agent-environment interactions and spanning diverse domains, tasks, agent types, and 10 failure modes. Our evaluations show that state-of-the-art LLM agents, including GPT-4.1, are highly vulnerable to STAC, with attack success rates (ASR) exceeding 90% in most cases. The core design of STAC's automated framework is a closed-loop pipeline that synthesizes executable multi-step tool chains, validates them through in-environment execution, and reverse-engineers stealthy multi-turn prompts that reliably induce agents to execute the verified malicious sequence. We further perform defense analysis against STAC and find that existing prompt-based defenses provide limited protection. To address this gap, we propose a new reasoning-driven defense prompt that achieves far stronger protection, cutting ASR by up to 28.8%. These results highlight a crucial gap: defending tool-enabled agents requires reasoning over entire action sequences and their cumulative effects, rather than evaluating isolated prompts or responses.

Ссылки и действия

Читать на arXiv Скачать PDF

Дополнительные ресурсы:

STAC: When Innocent Tools Form Dangerous Chains to Jailbreak LLM Agents

Авторы:

Abstract

Ссылки и действия

Связанные статьи

SELF: A Robust Singular Value and Eigenvalue Approach for LLM Fingerprinting

AutoMalDesc: Large-Scale Script Analysis for Cyber Threat Research

PISanitizer: Preventing Prompt Injection to Long-Context LLMs via Prompt Sanitiz...

SALT: Steering Activations towards Leakage-free Thinking in Chain of Thought

Position: Privacy Is Not Just Memorization!

Навигация