On Biased Compression for Distributed Learning
Authors: Aleksandr Beznosikov, Samuel Horváth, Peter Richtárik, Mher Safaryan
JMLR 2023 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In Sections 6.1–6.4, we present our experiments, which are primarily focused on supporting our theoretical findings. Therefore, we simulate these experiments on one machine which enable us to do rapid direct comparisons against the prior methods. In more details, we use the machine with 24 Intel(R) Xeon(R) Gold 6146 CPU @ 3.20GHz cores and GPU Ge Force GTX 1080 Ti. Section 6.5 is devoted to real experiments with a large model and big data. For these experiments, we use a computational cluster with 10 GPUs Tesla T4. We implement all methods in Python 3.7 using Pytorch Paszke et al. (2019). |
| Researcher Affiliation | Academia | Aleksandr Beznosikov EMAIL Computer, Electrical and Math. Sciences and Engineering Division King Abdullah University of Science and Technology, 23955, Thuwal, KSA Skolkovo Institute of Science and Technology, 121205, Moscow, Russia School of Applied Mathematics and Informatics Moscow Institute of Physics and Technology, 141701, Moscow, Russia |
| Pseudocode | Yes | Algorithm 1 Distributed SGD with Biased Compression and Error Feedback |
| Open Source Code | No | The paper states: "We implement all methods in Python 3.7 using Pytorch Paszke et al. (2019)." However, it does not provide an explicit statement about releasing their own source code for the methodology described in the paper, nor does it include a link to a repository. |
| Open Datasets | Yes | Practical distribution. We obtained various gradient distributions via logistic regression (mushrooms LIBSVM dataset) and least squares. We run 2 sets of experiments with Resnet18 on CIFAR10 dataset. Figure 4 displays training/test loss and accuracy for VGG19 on CIFAR10 with data equally distributed among 4 nodes. We train ALBERT-large (Lan et al., 2020) (18M parameters) with layer sharing on a combination of Bookcorpus (Zhu et al., 2015) and Wikipedia (Devlin et al., 2018) datasets. For the second experiment shown in Figure 8, we run standard linear regression on two scikit-learn datasets Boston and Diabetes and applied data normalization as the preprocessing step. |
| Dataset Splits | Yes | Figure 4 displays training/test loss and accuracy for VGG19 on CIFAR10 with data equally distributed among 4 nodes. We train ALBERT-large (Lan et al., 2020) (18M parameters) with layer sharing on a combination of Bookcorpus (Zhu et al., 2015) and Wikipedia (Devlin et al., 2018) datasets. We measure how the training loss changes (Figure 9) as well as at the end of training we evaluate the final performance for each model on several popular tasks from (Wang et al., 2018) (Table 5). |
| Hardware Specification | Yes | We use the machine with 24 Intel(R) Xeon(R) Gold 6146 CPU @ 3.20GHz cores and GPU Ge Force GTX 1080 Ti. Section 6.5 is devoted to real experiments with a large model and big data. For these experiments, we use a computational cluster with 10 GPUs Tesla T4. |
| Software Dependencies | No | The paper states: "We implement all methods in Python 3.7 using Pytorch Paszke et al. (2019)." While Python 3.7 is mentioned with a version, a specific version for PyTorch is not provided. "Paszke et al. (2019)" refers to the paper introducing PyTorch, not a version number used in this work. |
| Experiment Setup | Yes | We use plain SGD with a default step size equal to 0.01 for all methods, i.e. Top-5 with and without error feedback, Rand-5 and no compression. We use 2 levels with infinity norm for natural dithering and k = 5 for sparsification methods. For all the compression operators, we train VGG11 on CIFAR10 with plain SGD as an optimizer and default step size equal to 0.01. We use the same optimizer (LAMB) and the same tuning for it as in the original paper (Lan et al., 2020). |