AdaGK-SGD: Adaptive Global Knowledge Guided Distributed Stochastic Gradient Descent

Authors: Hangyu Ye, Weiying Xie, Yunsong Li, Leyuan Fang

AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Numerically, we find that Ada GK-SGD can significantly improve the accuracy and generalizability of distributed algorithms compared with existing methods. ... Experiments In this section, we evaluate Ada GK-SGD and the improved version with MLGK module of Slow Mo (Wang et al. 2019), EASGD (Zhang, Choromanska, and Le Cun 2015), and BMUF-Adam (Chen, Ding, and Huo 2020) on a variety of different image classification models and datasets. ... Performance on CIFAR-10/100. ... Performance on ILSVRC2012. ... Parametric Analysis
Researcher Affiliation Academia 1 State Key Laboratory of Integrated Services Networks, Xidian University, Xi an 710071, China 2 College of Electrical and Information Engineering, Hunan University, Changsha 410082, China EMAIL, EMAIL, EMAIL, leyuan EMAIL
Pseudocode Yes Algorithm 1: Ada GK-SGD Require: network scale N, global averaging period τ, total number of iterations T, learning rate η, auxiliary variable parameter αinitial parameter winit. Initialize: w(0) winit, w(0) Global winit, Z(0) 0, G(0) 0, M τ. 1: for k = 1, 2, ..., T every worker i do 2: Sample ξ(k+1) i , update g(k) i = Li ξi (k+1) , w(k) i . 3: µ(k) = max{s : s k and s mod τ = 0}. 4: if k = µ(k) then 5: w(µ(k)) Global = 1 N PN i=1 w(µ(k)) i . 6: Compute Z(µ(k)) i base on Equation 13 or other methods. 7: end if 8: w (k+ 1 2 ) i = w(k) i η g(k) i 9: Determine ψ based on Equation 14. 10: G(k) i = ψα (Z (µ(k)) i w (k+ 1 11: Compute M of global knowledge based on Equations 19 or 22. 12: w(k+1) i = w (k+ 1 2 ) i + G(k) i 13: end for
Open Source Code Yes Code https://github.com/Yehangyu-XD/Ada GK-SGD
Open Datasets Yes The datasets we use for our experiments are CIFAR10/100 and ILSVRC2012.
Dataset Splits No The datasets we use for our experiments are CIFAR10/100 and ILSVRC2012. ... The TOP-1 test accuracy of Ada GK-SGD and improved algorithms compared with that of the baseline and the original version of Slow Mo, EASGD, and BMUF-Adam. ... When it is not necessary to specify the parameters, the epoch is set to 100, and the local Batch Size is set to 256. The paper mentions "test accuracy" but does not specify how the datasets were split into training, validation, and test sets, nor does it explicitly state that standard splits were used with reference.
Hardware Specification Yes All experiments on the dataset CIFAR are performed on 4 NVIDIA GTX 3090 GPUs. All experiments on the dataset ILSVRC2012 are performed on 4 NVIDIA A100-SXM.
Software Dependencies No To ensure the reliability and validity of the experiments, the models in training are implemented with Pytorch. The paper mentions PyTorch but does not provide specific version numbers for it or any other key software dependencies.
Experiment Setup Yes When it is not necessary to specify the parameters, the epoch is set to 100, and the local Batch Size is set to 256. All experiments use the warm-up (Goyal et al. 2017) algorithm to improve convergence, specifically the learning rate is linearly increased to 0.01 in the earliest 5 epochs and then decreases to 10^-6 according to the cosine.