NorMuon: Making Muon more efficient and scalable
2510.05491v1
cs.LG, cs.CL
2025-10-09
Авторы:
Zichong Li, Liming Liu, Chen Liang, Weizhu Chen, Tuo Zhao
Abstract
The choice of optimizer significantly impacts the training efficiency and
computational costs of large language models (LLMs). Recently, the Muon
optimizer has demonstrated promising results by orthogonalizing parameter
updates, improving optimization geometry through better conditioning. Despite
Muon's emergence as a candidate successor to Adam, the potential for jointly
leveraging their strengths has not been systematically explored. In this work,
we bridge this gap by proposing NorMuon (Neuron-wise Normalized Muon), an
optimizer that synergistically combines orthogonalization with neuron-level
adaptive learning rates. Our analysis reveals that while Muon effectively
reduces condition numbers, the resulting updates exhibit highly non-uniform
neuron norms, causing certain neurons to dominate the optimization process.
NorMuon addresses this imbalance by maintaining second-order momentum
statistics for each neuron and applying row-wise normalization after
orthogonalization, ensuring balanced parameter utilization while preserving
Muon's conditioning benefits. To enable practical deployment at scale, we
develop an efficient distributed implementation under the FSDP2 framework that
strategically distributes orthogonalization computations across devices.
Experiments across multiple model scales demonstrate that NorMuon consistently
outperforms both Adam and Muon, achieving 21.74% better training efficiency
than Adam and 11.31% improvement over Muon on 1.1 B pretraining setting, while
maintaining a comparable memory footprint to Muon. Our findings suggest that
orthogonalization and adaptive learning rates are complementary rather than
competing approaches, opening new avenues for optimizer design in large-scale
deep learning.
Ссылки и действия
Дополнительные ресурсы: