How Muon's Spectral Design Benefits Generalization: A Study on Imbalanced Data
2510.22980v1
cs.LG, stat.ML
2025-10-29
Авторы:
Bhavya Vasudeva, Puneesh Deora, Yize Zhao, Vatsal Sharan, Christos Thrampoulidis
Abstract
The growing adoption of spectrum-aware matrix-valued optimizers such as Muon
and Shampoo in deep learning motivates a systematic study of their
generalization properties and, in particular, when they might outperform
competitive algorithms. We approach this question by introducing appropriate
simplifying abstractions as follows: First, we use imbalanced data as a
testbed. Second, we study the canonical form of such optimizers, which is
Spectral Gradient Descent (SpecGD) -- each update step is $UV^T$ where $U\Sigma
V^T$ is the truncated SVD of the gradient. Third, within this framework we
identify a canonical setting for which we precisely quantify when SpecGD
outperforms vanilla Euclidean GD. For a Gaussian mixture data model and both
linear and bilinear models, we show that unlike GD, which prioritizes learning
dominant principal components of the data first, SpecGD learns all principal
components of the data at equal rates. We demonstrate how this translates to a
growing gap in balanced accuracy favoring SpecGD early in training and further
show that the gap remains consistent even when the GD counterpart uses adaptive
step-sizes via normalization. By extending the analysis to deep linear models,
we show that depth amplifies these effects. We empirically verify our
theoretical findings on a variety of imbalanced datasets. Our experiments
compare practical variants of spectral methods, like Muon and Shampoo, against
their Euclidean counterparts and Adam. The results validate our findings that
these spectral optimizers achieve superior generalization by promoting a more
balanced learning of the data's underlying components.
Ссылки и действия
Дополнительные ресурсы: