NUQSGD: Provably Communication-efficient Data-parallel SGD via Nonuniform Quantization

Authors: Ali Ramezani-Kebrya, Fartash Faghri, Ilya Markov, Vitalii Aksenov, Dan Alistarh, Daniel M. Roy

JMLR 2021 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental To study how quantization affects convergence on state-of-the-art deep models, we compare NUQSGD, QSGD, QSGDinf, and EF-Sign SGD focusing on training loss, variance, and test accuracy on standard deep models and large datasets. Using the same number of bits per iteration, experimental results show that NUQSGD has smaller variance than QSGD, as predicted. The smaller variance also translates to improved optimization performance, in terms of both training loss and test accuracy. We also observe that NUQSGD matches the performance of QSGDinf in terms of variance and loss/accuracy. Further, our distributed implementation shows that the resulting algorithm considerably reduces communication cost of distributed training, without adversely impacting accuracy. Our empirical results show that NUQSGD can provide faster end-to-end parallel training relative to data-parallel SGD, QSGD, and EF-Sign SGD on the Image Net dataset (Deng et al., 2009), in particular when combined with non-trivial coding of the quantized gradients.
Researcher Affiliation Academia Ali Ramezani-Kebrya EMAIL Ecole Polytechnique F ed erale de Lausanne Route Cantonale, 1015 Lausanne, Switzerland Fartash Faghri EMAIL Dept. of Computer Science, Univ. of Toronto and Vector Institute Toronto, ON M5T 3A1, Canada Ilya Markov EMAIL Institute of Science and Technology Austria 3400 Klosterneuburg, Austria Vitalii Aksenov EMAIL Institute of Science and Technology Austria 3400 Klosterneuburg, Austria Dan Alistarh EMAIL Institute of Science and Technology Austria 3400 Klosterneuburg, Austria Daniel M. Roy EMAIL Dept. of Statistical Sciences, Univ. of Toronto and Vector Institute Toronto, ON M5S 3G3, Canada
Pseudocode Yes Algorithm 1: Data-parallel (synchronized) SGD. Algorithm 2: Elias recursive coding produces a bit string encoding of positive integers. Algorithm 3: ECD-PSGD with NUQSGD.
Open Source Code No Doing so efficiently requires non-trivial refactoring of this framework, since it does not support communication compression our framework will be open-sourced upon publication.
Open Datasets Yes We evaluate these methods on two image classification datasets: Image Net and CIFAR10. We train Res Net110 on CIFAR10 and Res Net18 on Image Net with mini-batch size 128 and base learning rate 0.1.
Dataset Splits Yes For CIFAR10, the hold-out set is the test set and for Image Net, the hold-out set is the validation set.
Hardware Specification Yes We show the execution time per epoch for Res Net34 and Res Net50 models on Image Net, on a cluster machine with 8 NVIDIA 2080 Ti GPUs, for the hyper-parameter values quoted above. We use the following hardware: CPU information: Intel(R) Xeon(R) Silver 4214 CPU @ 2.20GHz, 24 cores. Batch size 256, and can be found in Figure 18. Under these hyperparameter values the EF-Sign SGD algorithm sends 128 bits per each bucket of 64 values (32 for each scaling factor, and 64 for the signs), doubling its baseline communication cost. Moreover, the GPU implementation is not as efficient, as error feedback must be computed and updated at every step, and there is less parallelism to leverage inside each bucket. This explains the fact that the end-to-end performance is in fact close to that of the 8-bit NUQSGD variant, and inferior to 4-bit NUQSGD. Comparison under Small Mini-batch Size. In Figures 19, 20, and 21, we show the results when we train Res Net110 on CIFAR10 with mini-batch size 32 over 8 GPUs.
Software Dependencies No For this, we implement and test these three methods (NUQSGD, QSGD, and QSGDinf), together with the distributed full-precision SGD baseline, which we call Super SGD. Additionally, we will compare practical performance against a variant of Sign SGD with EF (Karimireddy et al., 2019). We split our study across two axes: first, we validate our theoretical analysis by examining the variance induced by the methods, as well as their convergence in terms of loss/accuracy versus number of samples processed. Second, we provide an efficient implementation of all four methods in Pytorch using the Horovod communication back-end (Sergeev and Del Balso, 2018), a communication back-end supporting Pytorch, Tensorflow and MXNet. We adapted Horovod to efficiently support quantization and gradient coding, and examine speedup relative to the full-precision baseline. Further, we examine the effect of quantization on training performance by measuring loss, variance, accuracy, and speedup for Res Net models (He et al., 2016) applied to Image Net and CIFAR10 (Krizhevsky, 2009). Convergence and Variance. Our first round of experiments examine the impact of quantization on solution quality. We evaluate these methods on two image classification datasets: Image Net and CIFAR10. We train Res Net110 on CIFAR10 and Res Net18 on Image Net with mini-batch size 128 and base learning rate 0.1. In all experiments, momentum and weight decay are set to 0.9 and 10 4, respectively. The bucket size (quantization granularity) and the number of quantization bits are set to 8192 and 4, respectively. We observed similar trends in experiments with various bucket sizes and number of bits per entry. We simulate a scenario with k GPUs for all three quantization methods by estimating the gradient from k independent mini-batches and aggregating them after quantization and dequantization.
Experiment Setup Yes We train Res Net110 on CIFAR10 and Res Net18 on Image Net with mini-batch size 128 and base learning rate 0.1. In all experiments, momentum and weight decay are set to 0.9 and 10 4, respectively. The bucket size (quantization granularity) and the number of quantization bits are set to 8192 and 4, respectively.