Compressed Decentralized Momentum Stochastic Gradient Methods for Nonconvex Optimization
Authors: Wei Liu, Anweshit Panda, Ujwal Pandey, Christopher Brissette, Yikang Shen, George Slota, Naigang Wang, Jie Chen, Yangyang Xu
TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Superior empirical performance is observed over state-of-the-art methods on training deep neural networks (DNNs) and Transformers. ... We now demonstrate the efficacy of the proposed algorithms over a set of numerical experiments. We consider three standard benchmarks, including training a convolutional neural network Le Net5 (Le Cun et al., 1998) on the Fashion MNIST dataset (Xiao et al., 2017), a restnet architecture Fixup-Res Net-20 (Zhang et al., 2019) on the CIFAR-10 dataset (Krizhevsky et al., 2009), and a small-scale GPT model, called Nano GPT (Andrej, 2022), on the tiny-shakespeare dataset. |
| Researcher Affiliation | Collaboration | Wei Liu EMAIL Department of Mathematical Sciences Rensselaer Polytechnic Institute ... Yikang Shen EMAIL MIT-IBM Watson AI Lab, IBM Research ... Naigang Wang EMAIL IBM T. J. Watson Research Center ... Jie Chen EMAIL MIT-IBM Watson AI Lab, IBM Research |
| Pseudocode | Yes | The pseudocode is shown in Alg. 1. Lines 5-6 follow AMSGrad and perform local update to the first and second momentum; Line 7 performs a local update to the model; xi is used to estimate the local model, while we compress the estimate error; Line 8 performs a neighbor communication, which can be realized through communicating the compressed vectors; see the discussions above Assumption 3. ... Algorithm 1: Decentralized AMSGrad with Compressed Communication (DAMSCo) ... Algorithm 2: Decentralized Stochastic Heavy-ball Method with Compressed Communication (Da SHCo) |
| Open Source Code | Yes | Our code is available for download at the following repository: https://github.com/Decentralized Methods/DAMSCo_Da SHCo. |
| Open Datasets | Yes | We consider three standard benchmarks, including training a convolutional neural network Le Net5 (Le Cun et al., 1998) on the Fashion MNIST dataset (Xiao et et al., 2017), a restnet architecture Fixup-Res Net-20 (Zhang et al., 2019) on the CIFAR-10 dataset (Krizhevsky et al., 2009), and a small-scale GPT model, called Nano GPT (Andrej, 2022), on the tiny-shakespeare dataset. |
| Dataset Splits | Yes | We measure and report objective loss on the full training data, accuracy on the full test data, and consensus error calculated as 1/n X 2. ... Next, we compare all methods on heterogeneous training data, with an equal number (i.e., 2) of label classes from each dataset distributed to each of the 5 MPI ranks. |
| Hardware Specification | Yes | Our methods and the methods for comparison are implemented in Python with Py Torch and MPI for Python (mpi4py) and they will be open-sourced upon publication. For Le Net5 and Fixup-Res Net-20, we run our experiments on a CPU server. This server has two-way 64-core (256 threads) AMD EPYC 7742 CPUs at 2.25GHz and 2TB DDR4 memory. It runs Ubuntu 20.04 with Py Torch version 2.3.0+cu121, Python 3.8.10, and mpi4py version 3.0.3. For Nano GPT, we run the experiments on 4 NVIDIA A100 GPUs. |
| Software Dependencies | Yes | Our methods and the methods for comparison are implemented in Python with Py Torch and MPI for Python (mpi4py)... It runs Ubuntu 20.04 with Py Torch version 2.3.0+cu121, Python 3.8.10, and mpi4py version 3.0.3. |
| Experiment Setup | Yes | We train Le Net5 on the Fashion MNIST dataset for 100 communication rounds for all optimizers. For CDProx SGT, we use the code published by the authors along with the same learning rate (0.02), batch size (8), and µ value for the regularizer (10-4), as tuned by the authors. We mirror these hyperparameter settings for Da SHCo. For DADAM and DAMSCo, we mirror the batch size of 8, set the learning rate to the Py Torch default for Adam of 0.001, and similarly use the standard β defaults of β1 = 0.9, β2 = 0.999. For DAda Grad, we use the same batch size and the Py Torch default learning rate of 0.01. ... For Fashion MNIST, we use top-k(0.3), communicating the largest 30% of values. For CIFAR-10, we use top-k(0.4). |