Guided by the Experts: Provable Feature Learning Dynamic of Soft-Routed Mixture-of-Experts
2510.07205v1
cs.LG, math.OC
2025-10-10
Авторы:
Fangshuo Liao, Anastasios Kyrillidis
Abstract
Mixture-of-Experts (MoE) architectures have emerged as a cornerstone of
modern AI systems. In particular, MoEs route inputs dynamically to specialized
experts whose outputs are aggregated through weighted summation. Despite their
widespread application, theoretical understanding of MoE training dynamics
remains limited to either separate expert-router optimization or only top-1
routing scenarios with carefully constructed datasets. This paper advances MoE
theory by providing convergence guarantees for joint training of soft-routed
MoE models with non-linear routers and experts in a student-teacher framework.
We prove that, with moderate over-parameterization, the student network
undergoes a feature learning phase, where the router's learning process is
``guided'' by the experts, that recovers the teacher's parameters. Moreover, we
show that a post-training pruning can effectively eliminate redundant neurons,
followed by a provably convergent fine-tuning process that reaches global
optimality. To our knowledge, our analysis is the first to bring novel insights
in understanding the optimization landscape of the MoE architecture.
Ссылки и действия
Дополнительные ресурсы: