Robust Layerwise Scaling Rules by Proper Weight Decay Tuning
2510.15262v1
cs.LG, cs.AI, stat.ML
2025-10-21
Авторы:
Zhiyuan Fan, Yifeng Liu, Qingyue Zhao, Angela Yuan, Quanquan Gu
Abstract
Empirical scaling laws prescribe how to allocate parameters, data, and
compute, while maximal-update parameterization ($\mu$P) enables learning-rate
transfer across widths by equalizing early-time update magnitudes. However, in
modern scale-invariant architectures, training quickly enters an
optimizer-governed steady state where normalization layers create backward
scale sensitivity and the effective learning rate becomes width dependent,
degrading $\mu$P transfer. We address this by introducing a weight-decay
scaling rule for AdamW that preserves sublayer gain across widths. Empirically,
the singular-value spectrum of each matrix parameter scales in norm as
$\sqrt{\eta/\lambda}$ with an approximately invariant shape; under width
scaling $d$, we observe that the top singular value scales approximately as
$\sqrt{\eta/\lambda}\cdot d^{0.75}$. Combining this observation with the $\mu$P
learning-rate rule $\eta_2\propto d^{-1}$ for matrix-like parameters implies an
empirical weight-decay scaling rule $\lambda_2\propto \sqrt{d}$ that
approximately keeps sublayer gains width invariant. Together with vector-like
parameters trained at $\eta_1=\Theta_d(1)$ and $\lambda_1=0$, this yields
\emph{zero-shot} transfer of both learning rate and weight decay from proxy to
target widths, removing per-width sweeps. We validate the rule on LLaMA-style
Transformers and in a minimal synthetic setting, and we provide a simple
diagnostic, matching top singular values, to check sublayer-gain invariance.
Our results extend $\mu$P beyond the near-init regime by explicitly controlling
steady-state scales set by the optimizer, offering a practical recipe for
width-robust hyperparameter transfer under AdamW.
Ссылки и действия
Дополнительные ресурсы: