MoE-Prism: Disentangling Monolithic Experts for Elastic MoE Services via Model-System Co-Designs
2510.19366v1
cs.CL, cs.LG
2025-10-24
Авторы:
Xinfeng Xia, Jiacheng Liu, Xiaofeng Hou, Peng Tang, Mingxuan Zhang, Wenfeng Wang, Chao Li
Abstract
Mixture-of-Experts (MoE) models, the state-of-the-art in large-scale AI,
achieve high quality by sparsely activating parameters. However, their reliance
on routing between a few monolithic experts via a top-k mechanism creates a
"quality cliff", offering only a few coarse-grained operating points. This
inflexibility forces a difficult trade-off between cost and quality, preventing
adaptation to diverse Service Level Objectives (SLOs) and leading to
significant resource over-provisioning.
This paper introduces MoE-Prism, a model-system co-design that transforms
rigid MoE models into elastic services. Our methodology is divided into two
phases. First, an \emph{Offline Refactoring Engine} systematically deconstructs
monolithic experts into fine-grained "sub-experts." This engine employs a
partitioning optimization solver that uses a metaheuristic-based approach to
group neurons, preserving functional locality without requiring retraining.
Second, an \emph{Online Scheduling Engine} leverages this new elasticity
through QoS-aware scheduling. It implements specialized policies to solve
complex system problems, including maximizing throughput in cloud deployments
and managing latency-optimized offloading for memory-constrained devices. Our
evaluation across three different MoE models shows that MoE-Prismprovides over
4 times more distinct, stable operating points than the baseline. This allows
an AI service to dynamically improve throughput by up to 19.9\% under a strict
latency budget or reduce latency by up to 10.36\% under limited resources.
MoE-Prism provides the critical "control knob" to bridge the model-system gap,
enabling the next generation of adaptive, efficient, and QoS-aware AI services.
Ссылки и действия
Дополнительные ресурсы: