Understanding the Generalization of Stochastic Gradient Adam in Learning Neural Networks
2510.11354v1
cs.LG, cs.AI, stat.ML
2025-10-15
Авторы:
Xuan Tang, Han Zhang, Yuan Cao, Difan Zou
Abstract
Adam is a popular and widely used adaptive gradient method in deep learning,
which has also received tremendous focus in theoretical research. However, most
existing theoretical work primarily analyzes its full-batch version, which
differs fundamentally from the stochastic variant used in practice. Unlike SGD,
stochastic Adam does not converge to its full-batch counterpart even with
infinitesimal learning rates. We present the first theoretical characterization
of how batch size affects Adam's generalization, analyzing two-layer
over-parameterized CNNs on image data. Our results reveal that while both Adam
and AdamW with proper weight decay $\lambda$ converge to poor test error
solutions, their mini-batch variants can achieve near-zero test error. We
further prove Adam has a strictly smaller effective weight decay bound than
AdamW, theoretically explaining why Adam requires more sensitive $\lambda$
tuning. Extensive experiments validate our findings, demonstrating the critical
role of batch size and weight decay in Adam's generalization performance.
Ссылки и действия
Дополнительные ресурсы: