Convergence Rates for Gradient Descent on the Edge of Stability in Overparametrised Least Squares
2510.17506v1
cs.LG, math.OC
2025-10-22
Авторы:
Lachlan Ewen MacDonald, Hancheng Min, Leandro Palma, Salma Tarmoun, Ziqing Xu, René Vidal
Abstract
Classical optimisation theory guarantees monotonic objective decrease for
gradient descent (GD) when employed in a small step size, or ``stable", regime.
In contrast, gradient descent on neural networks is frequently performed in a
large step size regime called the ``edge of stability", in which the objective
decreases non-monotonically with an observed implicit bias towards flat minima.
In this paper, we take a step toward quantifying this phenomenon by providing
convergence rates for gradient descent with large learning rates in an
overparametrised least squares setting. The key insight behind our analysis is
that, as a consequence of overparametrisation, the set of global minimisers
forms a Riemannian manifold $M$, which enables the decomposition of the GD
dynamics into components parallel and orthogonal to $M$. The parallel component
corresponds to Riemannian gradient descent on the objective sharpness, while
the orthogonal component is a bifurcating dynamical system. This insight allows
us to derive convergence rates in three regimes characterised by the learning
rate size: (a) the subcritical regime, in which transient instability is
overcome in finite time before linear convergence to a suboptimally flat global
minimum; (b) the critical regime, in which instability persists for all time
with a power-law convergence toward the optimally flat global minimum; and (c)
the supercritical regime, in which instability persists for all time with
linear convergence to an orbit of period two centred on the optimally flat
global minimum.
Ссылки и действия
Дополнительные ресурсы: