Vortex: Hosting ML Inference and Knowledge Retrieval Services With Tight Latency and Throughput Requirements
2511.02062v1
cs.DB, cs.AI
2025-11-06
Авторы:
Yuting Yang, Tiancheng Yuan, Jamal Hashim, Thiago Garrett, Jeffrey Qian, Ann Zhang, Yifan Wang, Weijia Song, Ken Birman
Abstract
There is growing interest in deploying ML inference and knowledge retrieval
as services that could support both interactive queries by end users and more
demanding request flows that arise from AIs integrated into a end-user
applications and deployed as agents. Our central premise is that these latter
cases will bring service level latency objectives (SLOs). Existing ML serving
platforms use batching to optimize for high throughput, exposing them to
unpredictable tail latencies. Vortex enables an SLO-first approach. For
identical tasks, Vortex's pipelines achieve significantly lower and more stable
latencies than TorchServe and Ray Serve over a wide range of workloads, often
enabling a given SLO target at more than twice the request rate. When RDMA is
available, the Vortex advantage is even more significant.
Ссылки и действия
Дополнительные ресурсы: