Vortex: Hosting ML Inference and Knowledge Retrieval Services With Tight Latency and Throughput Requirements

2511.02062v1 cs.DB, cs.AI 2025-11-06

Авторы:

Yuting Yang, Tiancheng Yuan, Jamal Hashim, Thiago Garrett, Jeffrey Qian, Ann Zhang, Yifan Wang, Weijia Song, Ken Birman

Abstract

There is growing interest in deploying ML inference and knowledge retrieval as services that could support both interactive queries by end users and more demanding request flows that arise from AIs integrated into a end-user applications and deployed as agents. Our central premise is that these latter cases will bring service level latency objectives (SLOs). Existing ML serving platforms use batching to optimize for high throughput, exposing them to unpredictable tail latencies. Vortex enables an SLO-first approach. For identical tasks, Vortex's pipelines achieve significantly lower and more stable latencies than TorchServe and Ray Serve over a wide range of workloads, often enabling a given SLO target at more than twice the request rate. When RDMA is available, the Vortex advantage is even more significant.

Ссылки и действия

Читать на arXiv Скачать PDF

Дополнительные ресурсы:

Vortex: Hosting ML Inference and Knowledge Retrieval Services With Tight Latency and Throughput Requirements

Авторы:

Abstract

Ссылки и действия

Связанные статьи

Thucy: An LLM-based Multi-Agent System for Claim Verification across Relational ...

Efficiently Sampling Interval Patterns from Numerical Databases

Beyond Relational: Semantic-Aware Multi-Modal Analytics with LLM-Native Query Op...

AskDB: An LLM Agent for Natural Language Interaction with Relational Databases

A Survey of Data Agents: Emerging Paradigm or Overstated Hype?

Навигация