Arithmetic-Mean $μ$P for Modern Architectures: A Unified Learning-Rate Scale for CNNs and ResNets
2510.04327v1
cs.LG, stat.ML
2025-10-08
Авторы:
Haosong Zhang, Shenxi Wu, Yichi Zhang, Wei Lin
Abstract
Choosing an appropriate learning rate remains a key challenge in scaling
depth of modern deep networks. The classical maximal update parameterization
($\mu$P) enforces a fixed per-layer update magnitude, which is well suited to
homogeneous multilayer perceptrons (MLPs) but becomes ill-posed in
heterogeneous architectures where residual accumulation and convolutions
introduce imbalance across layers. We introduce Arithmetic-Mean $\mu$P
(AM-$\mu$P), which constrains not each individual layer but the network-wide
average one-step pre-activation second moment to a constant scale. Combined
with a residual-aware He fan-in initialization - scaling residual-branch
weights by the number of blocks ($\mathrm{Var}[W]=c/(K\cdot
\mathrm{fan\text{-}in})$) - AM-$\mu$P yields width-robust depth laws that
transfer consistently across depths. We prove that, for one- and
two-dimensional convolutional networks, the maximal-update learning rate
satisfies $\eta^\star(L)\propto L^{-3/2}$; with zero padding, boundary effects
are constant-level as $N\gg k$. For standard residual networks with general
conv+MLP blocks, we establish $\eta^\star(L)=\Theta(L^{-3/2})$, with $L$ the
minimal depth. Empirical results across a range of depths confirm the $-3/2$
scaling law and enable zero-shot learning-rate transfer, providing a unified
and practical LR principle for convolutional and deep residual networks without
additional tuning overhead.
Ссылки и действия
Дополнительные ресурсы: