Distributed Low-Communication Training with Decoupled Momentum Optimization

2510.03371v1 cs.LG, cs.AI, cs.DC 2025-10-08

Авторы:

Sasho Nedelkoski, Alexander Acker, Odej Kao, Soeren Becker, Dominik Scheinert

Abstract

The training of large models demands substantial computational resources, typically available only in data centers with high-bandwidth interconnects. However, reducing the reliance on high-bandwidth interconnects between nodes enables the use of distributed compute resources as an alternative to centralized data center training. Building on recent advances in distributed model training, we propose an approach that further reduces communication by combining infrequent synchronizations across distributed model replicas with gradient momentum compression. In particular, we treat the optimizer momentum as a signal and decompose the Nesterov momentum into high- and low-frequency components via the discrete cosine transform (DCT). Only the high-frequency components are synchronized across model replicas every $H$ steps. Empirically, our method achieves up to a $16\times$ reduction in communication compared to the baseline DiLoCo, and it generalizes across architectures, including transformer-based language models and convolutional neural networks for images. Overall, this work advances the feasibility of training large models on distributed nodes with low-bandwidth interconnects.

Ссылки и действия

Читать на arXiv Скачать PDF

Дополнительные ресурсы:

Distributed Low-Communication Training with Decoupled Momentum Optimization

Авторы:

Abstract

Ссылки и действия

Связанные статьи

A Fast and Flat Federated Learning Method via Weighted Momentum and Sharpness-Aw...

Privacy in Federated Learning with Spiking Neural Networks

Federated style aware transformer aggregation of representations

Taming the Long-Tail: Efficient Reasoning RL Training with Adaptive Drafter

A Unified Convergence Analysis for Semi-Decentralized Learning: Sampled-to-Sampl...

Навигация