Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1]
Decentralized Learning: Theoretical Optimality and Practical Improvements
Authors: Yucheng Lu, Christopher De Sa
JMLR 2023 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Empirically, we compare De TAG with other decentralized algorithms on multiple vision benchmarks, including CIFAR10/100 and Image Net. We substantiate our theory and show De TAG converges faster on unshuffled data and in sparse networks. Furthermore, we study a De TAG variant, De TAG*, that practically speeds up data-center-scale model training. |
| Researcher Affiliation | Academia | Yucheng Lu EMAIL Department of Computer Science Cornell University Ithaca, NY 14850, USA Christopher De Sa EMAIL Department of Computer Science Cornell University Ithaca, NY 14850, USA |
| Pseudocode | Yes | Algorithm 1 Decentralized Stochastic Gradient Descent with Factorized Consensus Matrices (De Facto) on worker i Algorithm 2 Decentralized Stochastic Gradient Tracking with By-Phase Accelerated Gossip (De TAG) on worker i Algorithm 3 Accelerated Gossip (AG) with R steps Algorithm 4 De TAG with Momentum Acceleration (De TAGM) on worker i |
| Open Source Code | No | No explicit statement about code release or links to repositories for the methodology described in this paper were found. |
| Open Datasets | Yes | Empirically, we compare De TAG with other decentralized algorithms on multiple vision benchmarks, including CIFAR10/100 and Image Net. |
| Dataset Splits | Yes | To create the decentralized data, we first sort all the data points based on its labels, shuffle the first X% data points and then evenly split to different workers. The X controls the degree of decentralization, we test X = 0, 25, 50, 100 and plot the results in Figure 2. |
| Hardware Specification | No | We use 8-GPU ring graph there and uses each GPU as an individual worker. Subsequently, in Subsection 7.2, we extend the system to a 32-GPU ring graph and train Resnet18 on Image Net. |
| Software Dependencies | No | No specific software dependencies with version numbers were mentioned in the paper. |
| Experiment Setup | Yes | In the experiment of training Le Net on CIFAR10, we tune the step size using grid search inside the following range: {5e-3, 1e-3, 5e-4, 2.5e-4, 1e-4, 5e-5}. ... For De TAG, we further tune the accelerated gossip parameter η within {0, 0.1, 0.2, 0.4} and phase length R within {1, 2, 3}. We fix the momentum term to be 0.9 and weight decay to be 1e-4. ... The hyperparameters adopted for each runs are shown in Table 3 and Table 4. |