Pseudo-Asynchronous Local SGD: Robust and Efficient Data-Parallel Training

Authors: Hiroki Naganuma, Xinzhi Zhang, Man-Chung Yue, Ioannis Mitliagkas, Russell J. Hewett, Philipp Andre Witte, Yin Tat Lee

TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluated PALSGD on image classification and language modeling tasks. Our results show that PALSGD achieves better performance in less time compared to existing methods like Distributed Data Parallel (DDP), and Di Lo Co. Notably, PALSGD trains 18.4% faster than DDP on Image Net-1K with Res Net-50, 24.4% faster than DDP on Tiny Stories with GPTNeo-125M, and 21.1% faster than DDP on Tiny Stories with GPT-Neo 8M. Our contributions are as follows: Empirical Validation: We demonstrate the effectiveness of PALSGD through experiments on Image Net-1K (Deng et al., 2009), Tiny Stories (Eldan & Li, 2023), and CIFAR10 datasets. We show that it achieves superior training efficiency compared compared to existing methods like Distributed Data Parallel (DDP) and Di Lo Co (Douillard et al., 2023) (Section 7, Figures 2, 3, and 4).
Researcher Affiliation Collaboration Hiroki Naganuma1,2, , Xinzhi Zhang3, EMAIL, EMAIL 1Mila, 2Université de Montréal, 3University of Washington Man-Chung Yue4, Ioannis Mitliagkas1,2,5 EMAIL, EMAIL 4The University of Hong Kong, 5Canada CIFAR AI Chair Philipp A. Witte6, , Russell J. Hewett7, , , Yin Tat Lee3,6, EMAIL, EMAIL, EMAIL 6Microsoft, 7NVIDIA
Pseudocode Yes Algorithm 1: Pseudo-Asynchronous Local SGD with Decoupled Optimizers Data: x(0) (initial model), K > 0 (number of workers), p (0, 1) (probability of mixing step), ηt > 0 (mixing rate), H > 0 (sync interval), optimizers Inner OPT and Outer OPT, αt (learning rate for Inner OPT) for worker k = 1, , K do x(0) k x(0); for t = 0, , T 1 do b U[0, 1]; if b p then x(t) k x(t) k αtηt p (x(t) k x(t)); pseudo-synchronization step else Sample data ξ Dk; g(t) k f(x(t) k , ξ); x(t+1) k Inner OPT(x(t) k , g(t) k , αt 1 p); gradient step end if (t + 1) mod H = 0 then (t) All-Reduce(x(t 1) x(t) k ); aggregate outer gradient x(t+1) Outer OPT(x(t), (t)); update global model else x(t+1) x(t) end end end
Open Source Code Yes Our code is available at https://github.com/Hiroki11x/Pseudo-Asynchronous-Local SGD.
Open Datasets Yes We evaluated PALSGD on image classification and language modeling tasks. Our results show that PALSGD achieves better performance in less time compared to existing methods like Distributed Data Parallel (DDP), and Di Lo Co. Notably, PALSGD trains 18.4% faster than DDP on Image Net-1K with Res Net-50, 24.4% faster than DDP on Tiny Stories with GPTNeo-125M, and 21.1% faster than DDP on Tiny Stories with GPT-Neo 8M. Empirical Validation: We demonstrate the effectiveness of PALSGD through experiments on Image Net-1K (Deng et al., 2009), Tiny Stories (Eldan & Li, 2023), and CIFAR10 datasets. We show that it achieves superior training efficiency compared compared to existing methods like Distributed Data Parallel (DDP) and Di Lo Co (Douillard et al., 2023) (Section 7, Figures 2, 3, and 4).
Dataset Splits Yes The CIFAR-10 dataset is widely used in machine learning research, particularly for image recognition tasks. It contains 60,000 color images, each measuring 32 × 32 pixels, evenly distributed across ten distinct classes including airplanes, automobiles, birds, cats, deer, dogs, frogs, horses, ships, and trucks. The dataset consists of 50,000 training images and 10,000 test images, with each class represented by 6,000 images.
Hardware Specification Yes Hardware Configurations: We utilized three types of clusters for our experiments: Cluster A: GPU: NVIDIA Tesla T4 (16GB) x 4 GPU Bandwidth: 320.0 GB/s GPU Interconnect: PCIe with NUMA Node Interconnect (No NVLink) Cluster B: GPU: NVIDIA Tesla V100 DGXS (32GB) x 8 GPU Bandwidth: 897.0 GB/s GPU Interconnect: NVLink, 150 GB/s per GPU Cluster C: GPU: NVIDIA L40s (48GB) x 4 GPU Bandwidth: 864.0 GB/s GPU Interconnect: PCIe with NUMA Node Interconnect (No NVLink)
Software Dependencies Yes Software and Library Configurations: All GPU clusters used the following software environment: Python: 3.11.6 Py Torch: 2.3.1+cu121 CUDA: 12.1 CUDNN: 8902
Experiment Setup Yes Training Configuration: All experimental results, unless otherwise noted, refer to the hyper parameter configuration with the best results for that metric. The results of the ablation study are shown in Section E. The target loss shown in the training curve plots is based on the loss achieved by DDP within the predefined epoch budget. Small CNN on CIFAR-10 (Preliminary Simulation Experiments): The number of workers ranged from 4 to 64, while synchronization intervals tested were 32. PALSGD was set with p at 0.25. The outer optimizer used was Nesterov momentum, and the inner optimizer was momentum SGD. Inner learning rates of 0.01, 0.001, and 0.0001 were explored. Additionally, η was fixed at 1, and outer learning rate included 1, 0.1, and 0.01. No weight decay was applied, and training was conducted for 100 epochs. CIFAR10 Training on VGG-16: We trained the model for 200 epochs using 4 GPUs on a single node on Cluster C. The local batch size was set to 128 per GPU, resulting in a global batch size of 512. No gradient accumulation was used. The model architecture was selected from the torchvision model, with VGG-16 as the default. We used Momentum SGD as the inner optimizer with learning rates selected from {0.025, 0.05, 0.075}, and Nesterov Momentum SGD as the outer optimizer. The outer learning rate was chosen from {0.2, 0.4, 0.8, 1.2, 1.6}, with a warm-up period of 10 epochs followed by cosine annealing (Cosine Annealing LR) for scheduling. Weight decay was fixed at 1e-4. The synchronization interval H was set to 128, and Local SGD variants started after 2049 iterations (approximately at epoch 40). We used the PALSGD algorithm with decoupled updates enabled. For PALSGD-style adaptive synchronization, we explored probabilistic synchronization with p {0.02, 0.05, 0.1} and local step size η {0.1, 0.25, 0.5}. Image Net-1K Training on Res Net-50: We trained the model for 90 epochs with a local batch size of 64 per GPU, using 4 GPUs on Cluster C. The global batch size was set to 256. The inner optimizer’s learning rate was fixed at 0.001 with the Momentum SGD. For outer optimizer, we used Nesterov Momentum SGD with outer learning rate fixed at 0.1 to 0.2 as same as GPT-Neo experiments. The synchronization interval H was set to 64. The variants of the Local SGD algorithm started after 200K iterations (39 epoch). For PALSGD experiments, the probabilistic synchronization parameter p = 0.05 and ηt is 0.5 to 16. Tiny Stories Training on GPT-Neo-8M and 125M: Both the 8M and 125M models were trained following the training protocol described below. We trained the model for 15 epochs with a local batch size of 512 per GPU, and use Cluster A in the experiments with 4 GPUs, and Cluster B in the experiments with 8 GPUs. The global batch size was set to 2048. The inner optimizer learning rate was fixed at 0.001, and we employed Adam W (Loshchilov, 2017) for the inner optimizer with gradient clipping enabled. Regarding outer optimization for Di Lo Co and PALSGD, we use Nesterov Momentum SGD with the outer learning rate fixed at 0.1 to 0.2. The synchronization interval H was set to 16 (125M) or 64 (8M), the probabilistic synchronization parameter p = 0.1, ηt = 16. The variants of the Local SGD algorithm started after 1024 iterations. For ablation study, the synchronization interval H is set to 32 to 256, the probabilistic synchronization parameter p = 0.025 to 0.5 and ηt is 0.25 to 64.