AdaGK-SGD: Adaptive Global Knowledge Guided Distributed Stochastic Gradient Descent
Authors: Hangyu Ye, Weiying Xie, Yunsong Li, Leyuan Fang
AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Numerically, we find that Ada GK-SGD can significantly improve the accuracy and generalizability of distributed algorithms compared with existing methods. ... Experiments In this section, we evaluate Ada GK-SGD and the improved version with MLGK module of Slow Mo (Wang et al. 2019), EASGD (Zhang, Choromanska, and Le Cun 2015), and BMUF-Adam (Chen, Ding, and Huo 2020) on a variety of different image classification models and datasets. ... Performance on CIFAR-10/100. ... Performance on ILSVRC2012. ... Parametric Analysis |
| Researcher Affiliation | Academia | 1 State Key Laboratory of Integrated Services Networks, Xidian University, Xi an 710071, China 2 College of Electrical and Information Engineering, Hunan University, Changsha 410082, China EMAIL, EMAIL, EMAIL, leyuan EMAIL |
| Pseudocode | Yes | Algorithm 1: Ada GK-SGD Require: network scale N, global averaging period τ, total number of iterations T, learning rate η, auxiliary variable parameter αinitial parameter winit. Initialize: w(0) winit, w(0) Global winit, Z(0) 0, G(0) 0, M τ. 1: for k = 1, 2, ..., T every worker i do 2: Sample ξ(k+1) i , update g(k) i = Li ξi (k+1) , w(k) i . 3: µ(k) = max{s : s k and s mod τ = 0}. 4: if k = µ(k) then 5: w(µ(k)) Global = 1 N PN i=1 w(µ(k)) i . 6: Compute Z(µ(k)) i base on Equation 13 or other methods. 7: end if 8: w (k+ 1 2 ) i = w(k) i η g(k) i 9: Determine ψ based on Equation 14. 10: G(k) i = ψα (Z (µ(k)) i w (k+ 1 11: Compute M of global knowledge based on Equations 19 or 22. 12: w(k+1) i = w (k+ 1 2 ) i + G(k) i 13: end for |
| Open Source Code | Yes | Code https://github.com/Yehangyu-XD/Ada GK-SGD |
| Open Datasets | Yes | The datasets we use for our experiments are CIFAR10/100 and ILSVRC2012. |
| Dataset Splits | No | The datasets we use for our experiments are CIFAR10/100 and ILSVRC2012. ... The TOP-1 test accuracy of Ada GK-SGD and improved algorithms compared with that of the baseline and the original version of Slow Mo, EASGD, and BMUF-Adam. ... When it is not necessary to specify the parameters, the epoch is set to 100, and the local Batch Size is set to 256. The paper mentions "test accuracy" but does not specify how the datasets were split into training, validation, and test sets, nor does it explicitly state that standard splits were used with reference. |
| Hardware Specification | Yes | All experiments on the dataset CIFAR are performed on 4 NVIDIA GTX 3090 GPUs. All experiments on the dataset ILSVRC2012 are performed on 4 NVIDIA A100-SXM. |
| Software Dependencies | No | To ensure the reliability and validity of the experiments, the models in training are implemented with Pytorch. The paper mentions PyTorch but does not provide specific version numbers for it or any other key software dependencies. |
| Experiment Setup | Yes | When it is not necessary to specify the parameters, the epoch is set to 100, and the local Batch Size is set to 256. All experiments use the warm-up (Goyal et al. 2017) algorithm to improve convergence, specifically the learning rate is linearly increased to 0.01 in the earliest 5 epochs and then decreases to 10^-6 according to the cosine. |