Staleness-Aware Async-SGD for Distributed Deep Learning
Authors: Wei Zhang, Suyog Gupta, Xiangru Lian, Ji Liu
IJCAI 2016 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experimental verification is performed on commonly-used image classification benchmarks: CIFAR10 and Imagenet to demonstrate the superior effectiveness of the proposed approach, compared to SSGD (Synchronous SGD) and the conventional ASGD algorithm. |
| Researcher Affiliation | Collaboration | Wei Zhang, Suyog Gupta IBM T. J. Watson Research Center Yorktown Heights, NY 10598, USA EMAIL Xiangru Lian, Ji Liu Department of Computer Science University of Rochester, NY 14627, USA EMAIL |
| Pseudocode | No | No structured pseudocode or algorithm blocks were found. |
| Open Source Code | No | No explicit statement about the release of source code or a link to a code repository was found. |
| Open Datasets | Yes | We present results on two datasets: CIFAR10 [Krizhevsky and Hinton, 2009] and Image Net [Russakovsky et al., 2015]. |
| Dataset Splits | Yes | The CIFAR10 [Krizhevsky and Hinton, 2009] dataset comprises of a total of 60,000 RGB images of size 32 32 pixels partitioned into the training set (50,000 images) and the test set (10,000 images). ... The training set is a subset of the Image Net database and contains 1.2 million 256 256 pixel images. The validation dataset has 50,000 images. |
| Hardware Specification | Yes | We deploy our implementation on a P775 supercomputer. Each node of this system contains four eight-core 3.84 GHz IBM POWER7 processors, one optical connect controller chip and 128 GB of memory. A single node has a theoretical floating point peak performance of 982 Gflop/s, memory bandwidth of 512 GB/s and bi-directional interconnect bandwidth of 192 GB/s. |
| Software Dependencies | No | The paper mentions using 'MPI' and the 'open-source Caffe deep learning package' but does not provide specific version numbers for these or any other software dependencies. |
| Experiment Setup | Yes | When using a single learner, the mini-batch size is set to 128 and training for 140 epochs using momentum accelerated SGD (momentum = 0.9) results in a model... The base learning rate 0 is set to 0.001 and reduced by a factor of 10 after the 120th and 130th epoch. In order to achieve comparable model accuracy as the single-learner, we follow the prescription of [Gupta et al., 2015] and reduce the minibatch size per learner as more learners are added to the system in order to keep the product of mini-batch size and number of learners approximately invariant. ... With a single learner, training with mini-batch size of 256, momentum 0.9 results in top-1 error of 42.56% and top-5 error of 19.18% on the validation set at the end of 35 epochs. The initial learning rate 0 is set to 0.01 and reduced by a factor of 5 after the 20th and again after the 30th epoch. |