Faster Convergence of Local SGD for Over-Parameterized Models

Authors: Tiancheng Qin, S. Rasoul Etesami, Cesar A Uribe

TMLR 2024 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Finally, we validate our theoretical results by performing large-scale numerical experiments that reveal the convergence behavior of Local SGD for practical over-parameterized deep learning models, in which the O(1/T) convergence rate of Local SGD is clearly shown.
Researcher Affiliation Academia Tiancheng Qin EMAIL Department of Industrial and Systems Engineering, Coordinated Science Laboratory University of Illinois at Urbana-Champaign S. Rasoul Etesami EMAIL Department of Industrial and Systems Engineering, Coordinated Science Laboratory University of Illinois at Urbana-Champaign Cesár A. Uribe EMAIL Department of Electrical and Computer Engineering Rice University
Pseudocode Yes The pseudo-code for the Local SGD algorithm is provided in Algorithm 1.
Open Source Code No The paper does not contain any explicit statements or links indicating that source code for the described methodology is publicly available.
Open Datasets Yes We distribute the Cifar10 dataset (Krizhevsky et al., 2009) to n = 20 nodes and apply Local SGD to train a Res Net18 neural network (He et al., 2016) on the Cifar10 dataset (Krizhevsky et al., 2009).
Dataset Splits Yes We first sort the data by their label, then divide the dataset into 20 shards and assign each of 20 nodes 1 shard. In this way, ten nodes will have image examples of one label, and ten nodes will have image examples of two labels. This regime leads to highly heterogeneous datasets among nodes. ... We partition the dataset in three different ways to reflect different data similarity regimes and evaluate the relationship between training loss, communication rounds, and local steps for Local SGD under each of the three regimes.
Hardware Specification No The paper does not provide specific hardware details such as GPU/CPU models, processor types, or memory specifications used for running the experiments.
Software Dependencies No The paper mentions using Layer Normalization instead of Batch Normalization within the ResNet18 architecture, but does not specify any software libraries or packages with version numbers for reproducibility.
Experiment Setup Yes For this set of experiments, we run the Local SGD algorithm for R = 20000 communication rounds with a different number of local steps per communication round K = 1, 2, 5, 10, 20... We use a training batch size of 8 and choose stepsize η to be 0.1... We stop the algorithm after at most 106 communication rounds or if the training loss is below 10 4. We choose stepsize η = 0.075.