Efficient Fine-Grained GPU Performance Modeling for Distributed Deep Learning of LLM

2509.22832v1 cs.DC, cs.AI, cs.LG 2025-10-01

Авторы:

Biyao Zhang, Mingkai Zheng, Debargha Ganguly, Xuecen Zhang, Vikash Singh, Vipin Chaudhary, Zhao Zhang

Abstract

Training Large Language Models(LLMs) is one of the most compute-intensive tasks in high-performance computing. Predicting end-to-end training time for multi-billion parameter models distributed across hundreds of GPUs remains challenging due to complex interactions between transformer components, parallelism strategies(data, model, pipeline, tensor), and multi-tier communication. Learned models require costly sampling, while analytical models often struggle with real-world network and hardware complexities. We address this by decomposing LLMs into core computational primitives and modeling them with: (1) operator-level decomposition for fine-grained analysis; (2) lightweight sampling based hardware-aware prediction models for key operations; (3) an end-to-end prediction system integrating these components across complex parallelization strategies. Crucially, our methodology has been validated on two large-scale HPC systems. Our framework achieves low average prediction errors-4.98\% on Perlmutter(A100) and 9.38\% on Vista(GH200)-for models up to 20B parameters across 128 GPUs. Importantly, it runs entirely on CPUs, enabling rapid iteration over hardware configurations and training strategies without costly on-cluster experimentation.

Ссылки и действия

Читать на arXiv Скачать PDF

Дополнительные ресурсы:

Efficient Fine-Grained GPU Performance Modeling for Distributed Deep Learning of LLM

Авторы:

Abstract

Ссылки и действия

Связанные статьи

Serving Heterogeneous LoRA Adapters in Distributed LLM Inference Systems

Federated Attention: A Distributed Paradigm for Collaborative LLM Inference over...

Towards Straggler-Resilient Split Federated Learning: An Unbalanced Update Appro...

HybridEP: Scaling Expert Parallelism to Cross-Datacenter Scenario via Hybrid Exp...

ShadowServe: Interference-Free KV Cache Fetching for Distributed Prefix Caching

Навигация