FIRST: Federated Inference Resource Scheduling Toolkit for Scientific AI Model Access
2510.13724v1
cs.DC, cs.AI, cs.SE
2025-10-17
Авторы:
Aditya Tanikanti, Benoit Côté, Yanfei Guo, Le Chen, Nickolaus Saint, Ryan Chard, Ken Raffenetti, Rajeev Thakur, Thomas Uram, Ian Foster, Michael E. Papka, Venkatram Vishwanath
Abstract
We present the Federated Inference Resource Scheduling Toolkit (FIRST), a
framework enabling Inference-as-a-Service across distributed High-Performance
Computing (HPC) clusters. FIRST provides cloud-like access to diverse AI
models, like Large Language Models (LLMs), on existing HPC infrastructure.
Leveraging Globus Auth and Globus Compute, the system allows researchers to run
parallel inference workloads via an OpenAI-compliant API on private, secure
environments. This cluster-agnostic API allows requests to be distributed
across federated clusters, targeting numerous hosted models. FIRST supports
multiple inference backends (e.g., vLLM), auto-scales resources, maintains
"hot" nodes for low-latency execution, and offers both high-throughput batch
and interactive modes. The framework addresses the growing demand for private,
secure, and scalable AI inference in scientific workflows, allowing researchers
to generate billions of tokens daily on-premises without relying on commercial
cloud infrastructure.
Ссылки и действия
Дополнительные ресурсы: