Justitia: Fair and Efficient Scheduling for LLM Applications
2510.17015v1
cs.LG, cs.AI, cs.DC
2025-10-22
Авторы:
Mingyan Yang, Guanjie Wang, Manqi Luo, Yifei Liu, Chen Chen, Han Zhao, Yu Feng, Quan Chen, Minyi Guo
Abstract
In the era of Large Language Models (LLMs), it has been popular to launch a
series of LLM inferences -- we call an LLM application -- to better solve
real-world problems. When serving those applications in shared GPU servers, the
schedulers are expected to attain fast application completions with guaranteed
worst-case performance. However, mainstream LLM schedulers fail to behave well
for LLM applications -- due to head-of-line blocking or over-constrained
resource allocation. In this paper, we propose to serve LLM applications in a
fair and also efficient manner. To this end, we design Justitia, a novel
scheduler with three key techniques. First, given that memory is prevalently a
bottleneck for mainstream inference frameworks like vLLM, Justitia models the
service cost of LLM applications in a memory-centric manner. Meanwhile, it uses
a simple neural network model to conduct light-weight and also accurate demand
prediction. Moreover, Justitia adopts a virtual-time based fair queuing
algorithm to reduce the overall performance with guaranteed worst-case delay.
We have implemented Justitia atop vLLM, and experimental results involving
diverse LLM applications show that it can substantially enhance the scheduling
efficiency with fairness preserved.
Ссылки и действия
Дополнительные ресурсы: