HybridEP: Scaling Expert Parallelism to Cross-Datacenter Scenario via Hybrid Expert/Data Transmission
2510.19470v1
cs.DC, cs.AI, cs.LG
2025-10-24
Авторы:
Weihao Yang, Hao Huang, Donglei Wu, Ningke Li, Yanqi Pan, Qiyang Zheng, Wen Xia, Shiyi Li, Qiang Wang
Abstract
Mixture-of-Experts (MoE) has become a popular architecture for scaling large
models. However, the rapidly growing scale outpaces model training on a single
DC, driving a shift toward a more flexible, cross-DC training paradigm. Under
this, Expert Parallelism (EP) of MoE faces significant scalability issues due
to the limited cross-DC bandwidth. Specifically, existing EP optimizations
attempt to overlap data communication and computation, which has little benefit
in low-bandwidth scenarios due to a much longer data communication time.
Therefore, the trends of cross-DC EP scaling is fast becoming a critical
roadblock to the continued growth of MoE models.
To address this, we propose HybridEP, a modeling-guided framework to optimize
EP under constrained bandwidth. Our key idea is to dynamically transform the
spatial placement of experts to reduce data communication traffic and
frequency, thereby minimizing EP's communication overheads. However, it is
non-trivial to find the optimal solution because it complicates the original
communication pattern by mixing data and expert communication. We therefore
build a stream-based model to determine the optimal transmission ratio. Guided
by this, we incorporate two techniques: (1) domain-based partition to construct
the mapping between hybrid patterns and specific communication topology at GPU
level, and (2) parameter-efficient migration to further refine this topology by
reducing expert transmission overhead and enlarging the domain size. Combining
all these designs, HybridEP can be considered as a more general EP with better
scalability. Experimental results show that HybridEP outperforms existing
state-of-the-art MoE training systems by up to 5.6x under constrained
bandwidth. We further compare HybridEP and EP on large-scale simulations.
HybridEP achieves up to 1.45x speedup with 1k DCs under different bandwidths.
Ссылки и действия
Дополнительные ресурсы: