Continuum: Efficient and Robust Multi-Turn LLM Agent Scheduling with KV Cache Time-to-Live
2511.02230v1
cs.OS, cs.AI, cs.NI
2025-11-06
Авторы:
Hanchen Li, Qiuyang Mang, Runyuan He, Qizheng Zhang, Huanzhi Mao, Xiaokun Chen, Alvin Cheung, Joseph Gonzalez, Ion Stoica
Abstract
Agentic LLM applications interleave LLM generation requests with tool calls.
These tool calls break the continuity of the workflow by creating pauses
between LLM requests, bringing many challenges for the serving system,
especially under multi-turn scenarios. Each pause potentially causes KV cache
eviction and extra waiting time before entering the continuous batch for the
following LLM request. Since these pauses happen for each call, this problem
becomes increasingly severe as turn number grow for agentic programs. Previous
works either fail to incorporate information from the tool call, evicting KV
cache that leads to repetitive prefill or loading, or ignore the continuity of
a multi-turn program, creating waiting time between turns that increases
per-request latency.
We present Continuum, a serving system to optimize job completion time for
multi-turn agent workloads by combining tool-aware KV cache timeout with
program-level scheduling. By predicting tool call durations in agentic
workflows, Continuum selectively pins the KV cache in GPU memory with a
time-to-live value based on total turn number. When combined with program-level
first-come-first-serve, Continuum prevents scheduling bubbles, preserves
multi-turn continuity, and optimizes for throughput for complex agentic
workflows. By modeling the variability of tool call and agent program
continuity, Continuum outperforms state-of-the-art baselines. Our evaluation on
real-world agentic workloads (SWE-Bench and BFCL) with Llama-3.1 8B/70B models
shows that Continuum significantly improves the average job completion times,
and remains performant across different hardware setups and DRAM offloading
schemes. Preview code is available at:
https://github.com/Hanchenli/vllm-continuum
Ссылки и действия
Дополнительные ресурсы: