Why Do We Need Warm-up? A Theoretical Perspective
2510.03164v1
cs.LG, math.OC, stat.ML
2025-10-07
Авторы:
Foivos Alimisis, Rustem Islamov, Aurelien Lucchi
Abstract
Learning rate warm-up - increasing the learning rate at the beginning of
training - has become a ubiquitous heuristic in modern deep learning, yet its
theoretical foundations remain poorly understood. In this work, we provide a
principled explanation for why warm-up improves training. We rely on a
generalization of the $(L_0, L_1)$-smoothness condition, which bounds local
curvature as a linear function of the loss sub-optimality and exhibits
desirable closure properties. We demonstrate both theoretically and empirically
that this condition holds for common neural architectures trained with
mean-squared error and cross-entropy losses. Under this assumption, we prove
that Gradient Descent with a warm-up schedule achieves faster convergence than
with a fixed step-size, establishing upper and lower complexity bounds.
Finally, we validate our theoretical insights through experiments on language
and vision models, confirming the practical benefits of warm-up schedules.