xLLM Technical Report
2510.14686v1
cs.DC, cs.AI
2025-10-18
Авторы:
Tongxuan Liu, Tao Peng, Peijun Yang, Xiaoyang Zhao, Xiusheng Lu, Weizhe Huang, Zirui Liu, Xiaoyu Chen, Zhiwei Liang, Jun Xiong, Donghe Jin, Minchao Zhang, Jinrong Guo, Yingxu Deng, Xu Zhang, Xianzhe Dong, Siqi Wang, Siyu Wu, Yu Wu, Zihan Tang, Yuting Zeng, Yanshu Wang, Jinguang Liu, Meng Kang, Menxin Li, Yunlong Wang, Yiming Liu, Xiaolong Ma, Yifan Wang, Yichen Zhang, Jinrun Yin, Keyang Zheng, Jiawei Yin, Jun Zhang, Ziyue Wang, Xiaobo Lin, Liangyu Liu, Liwei Lan, Yang Liu, Chunhua Peng, Han Liu, Songcheng Ren, Xuezhu Wang, Yunheng Shen, Yi Wang, Guyue Liu, Hui Chen, Tong Yang, Hailong Yang, Jing Li, Guiguang Ding, Ke Zhang
Abstract
We introduce xLLM, an intelligent and efficient Large Language Model (LLM)
inference framework designed for high-performance, large-scale enterprise-grade
serving, with deep optimizations for diverse AI accelerators. To address these
challenges, xLLM builds a novel decoupled service-engine architecture. At the
service layer, xLLM-Service features an intelligent scheduling module that
efficiently processes multimodal requests and co-locates online and offline
tasks through unified elastic scheduling to maximize cluster utilization. This
module also relies on a workload-adaptive dynamic Prefill-Decode (PD)
disaggregation policy and a novel Encode-Prefill-Decode (EPD) disaggregation
policy designed for multimodal inputs. Furthermore, it incorporates a
distributed architecture to provide global KV Cache management and robust
fault-tolerant capabilities for high availability. At the engine layer,
xLLM-Engine co-optimizes system and algorithm designs to fully saturate
computing resources. This is achieved through comprehensive multi-layer
execution pipeline optimizations, an adaptive graph mode and an xTensor memory
management. xLLM-Engine also further integrates algorithmic enhancements such
as optimized speculative decoding and dynamic EPLB, collectively serving to
substantially boost throughput and inference efficiency. Extensive evaluations
demonstrate that xLLM delivers significantly superior performance and resource
efficiency. Under identical TPOT constraints, xLLM achieves throughput up to
1.7x that of MindIE and 2.2x that of vLLM-Ascend with Qwen-series models, while
maintaining an average throughput of 1.7x that of MindIE with Deepseek-series
models. xLLM framework is publicly available at
https://github.com/jd-opensource/xllm and
https://github.com/jd-opensource/xllm-service.
Ссылки и действия
Дополнительные ресурсы: