Global Dynamics of Heavy-Tailed SGDs in Nonconvex Loss Landscape: Characterization and Control
2510.20905v1
cs.LG, math.PR
2025-10-28
Авторы:
Xingyu Wang, Chang-Han Rhee
Abstract
Stochastic gradient descent (SGD) and its variants enable modern artificial
intelligence. However, theoretical understanding lags far behind their
empirical success. It is widely believed that SGD has a curious ability to
avoid sharp local minima in the loss landscape, which are associated with poor
generalization. To unravel this mystery and further enhance such capability of
SGDs, it is imperative to go beyond the traditional local convergence analysis
and obtain a comprehensive understanding of SGDs' global dynamics. In this
paper, we develop a set of technical machinery based on the recent large
deviations and metastability analysis in Wang and Rhee (2023) and obtain sharp
characterization of the global dynamics of heavy-tailed SGDs. In particular, we
reveal a fascinating phenomenon in deep learning: by injecting and then
truncating heavy-tailed noises during the training phase, SGD can almost
completely avoid sharp minima and achieve better generalization performance for
the test data. Simulation and deep learning experiments confirm our theoretical
prediction that heavy-tailed SGD with gradient clipping finds local minima with
a more flat geometry and achieves better generalization performance.
Ссылки и действия
Дополнительные ресурсы: