ErrorCompensatedX: error compensation for variance reduced algorithms
Authors: Hanlin Tang, Yao Li, Ji Liu, Ming Yan
NeurIPS 2021 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this section, we train Res Net-50 (He et al., 2016) on CIFAR10, which consists of 50000 training images and 10000 testing images, each has 10 labels. We run the experiments on eight workers, each having a 1080Ti GPU. The batch size on each worker is 16 and the total batch size is 128. ... Figure 2: Epoch-wise convergence comparison on Res Net-50 for Momenum SGD (left column), STORM (middle column), and IGT (right column) with different communication implementations. |
| Researcher Affiliation | Collaboration | Hanlin Tang Department of Computer Science University of Rochester EMAIL Yao Li Department of Mathematics Michigan State University EMAIL Ji Liu Kuaishou Technology EMAIL Ming Yan Department of Computational Mathematics, Science and Technology; Department of Mathematics Michigan State University EMAIL |
| Pseudocode | Yes | Algorithm 1 Error Compensated X for general A (x; ξ) |
| Open Source Code | No | The paper does not provide any statement about releasing source code or a link to a code repository for the described methodology. |
| Open Datasets | Yes | In this section, we train Res Net-50 (He et al., 2016) on CIFAR10, which consists of 50000 training images and 10000 testing images, each has 10 labels. |
| Dataset Splits | No | The paper states '50000 training images and 10000 testing images' for CIFAR-10 but does not specify a validation set split. |
| Hardware Specification | Yes | We run the experiments on eight workers, each having a 1080Ti GPU. |
| Software Dependencies | No | The paper does not specify version numbers for any software dependencies used in the experiments. |
| Experiment Setup | Yes | The batch size on each worker is 16 and the total batch size is 128. ... We use the 1-bit compression in Tang et al. (2019), which leads to an overall 96% of communication volume reduction. ... We grid search the best learning rate from {0.5, 0.1, 0.001} and c0 from {0.1, 0.05, 0.001}, and find that the best learning rate is 0.01 with c0 = 0.05 for both original STORM and IGT. ... We set β = 0.3 for the low-pass filter in all cases. |