Dendrograms of Mixing Measures for Softmax-Gated Gaussian Mixture of Experts: Consistency without Model Sweeps
2510.12744v1
stat.ML, cs.LG, math.ST, stat.CO, stat.ME, stat.TH
2025-10-16
Авторы:
Do Tien Hai, Trung Nguyen Mai, TrungTin Nguyen, Nhat Ho, Binh T. Nguyen, Christopher Drovandi
Abstract
We develop a unified statistical framework for softmax-gated Gaussian mixture
of experts (SGMoE) that addresses three long-standing obstacles in parameter
estimation and model selection: (i) non-identifiability of gating parameters up
to common translations, (ii) intrinsic gate-expert interactions that induce
coupled differential relations in the likelihood, and (iii) the tight
numerator-denominator coupling in the softmax-induced conditional density. Our
approach introduces Voronoi-type loss functions aligned with the gate-partition
geometry and establishes finite-sample convergence rates for the maximum
likelihood estimator (MLE). In over-specified models, we reveal a link between
the MLE's convergence rate and the solvability of an associated system of
polynomial equations characterizing near-nonidentifiable directions. For model
selection, we adapt dendrograms of mixing measures to SGMoE, yielding a
consistent, sweep-free selector of the number of experts that attains
pointwise-optimal parameter rates under overfitting while avoiding multi-size
training. Simulations on synthetic data corroborate the theory, accurately
recovering the expert count and achieving the predicted rates for parameter
estimation while closely approximating the regression function. Under model
misspecification (e.g., $\epsilon$-contamination), the dendrogram selection
criterion is robust, recovering the true number of mixture components, while
the Akaike information criterion, the Bayesian information criterion, and the
integrated completed likelihood tend to overselect as sample size grows. On a
maize proteomics dataset of drought-responsive traits, our dendrogram-guided
SGMoE selects two experts, exposes a clear mixing-measure hierarchy, stabilizes
the likelihood early, and yields interpretable genotype-phenotype maps,
outperforming standard criteria without multi-size training.