Measuring the Effects of Data Parallelism on Neural Network Training

Authors: Christopher J. Shallue, Jaehoon Lee, Joseph Antognini, Jascha Sohl-Dickstein, Roy Frostig, George E. Dahl

JMLR 2019 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this work, we aim to experimentally characterize the effects of increasing the batch size on training time, as measured by the number of steps necessary to reach a goal out-of-sample error. We study how this relationship varies with the training algorithm, model, and data set, and find extremely large variation between workloads. Along the way, we show that disagreements in the literature on how batch size affects model quality can largely be explained by differences in metaparameter tuning and compute budgets at different batch sizes. We find no evidence that larger batch sizes degrade out-of-sample performance. Finally, we discuss the implications of our results on efforts to train neural networks much faster in the future. Our experimental data is publicly available as a database of 71,638,836 loss measurements taken over the course of training for 168,160 individual models across 35 workloads.
Researcher Affiliation Industry Google Brain 1600 Amphiteatre Parkway Mountain View, CA, 94043, USA
Pseudocode No The paper describes algorithms like SGD, SGD with momentum, and Nesterov momentum using mathematical equations and update rules in Section 2.2 Algorithms, but it does not present any structured pseudocode blocks or algorithms labeled as such.
Open Source Code Yes We release our raw experimental data for any further analysis by the research community.3 ... 3 https://github.com/google-research/google-research/tree/master/batch_science ... 4 https://colab.research.google.com/github/google-research/google-research/blob/master/ batch_science/reproduce_paper_plots.ipynb
Open Datasets Yes Our database contains 454 combinations of workload (model, data set, training algorithm) and batch size, each of which is associated with a metaparameter search space and a set of models trained with different configurations sampled from the search space. In total, our data contains 71,638,836 loss measurements taken over the course of training for 168,160 individual models. Together, these measurements make up the training curves of all of the individual models we trained, and can be used to reproduce all plots in this paper.4 ... We used seven image and text data sets with training set sizes ranging from 45,000 to 26 billion examples. Table 1 summarizes these data sets and Appendix A provides the full details. ... MNIST (Le Cun et al., 1998) ... Fashion MNIST (Xiao et al., 2017) ... CIFAR-10 (Krizhevsky, 2009) ... Image Net (Russakovsky et al., 2015) ... Open Images v4 (Krasin et al., 2017) ... LM1B (Chelba et al., 2014) ... Common Crawl ... All available at their respective links provided in footnotes or standard citations.
Dataset Splits Yes MNIST (Le Cun et al., 1998) is a classic handwritten digit image classification data set with 10 mutually exclusive classes. We split the original training set into 55,000 training images and 5,000 validation images, and used the official test set of 10,000 images. ... Fashion MNIST (Xiao et al., 2017) ... We split the original training set into 55,000 training images and 5,000 validation images, and used the official test set of 10,000 images. ... CIFAR-10 (Krizhevsky, 2009) ... We split the original training set into 45,000 training images and 5,000 validation images. We used the official test set of 10,000 images. ... Image Net (Russakovsky et al., 2015) ... We split the official training set into 1,281,167 training images and 50,045 test images, and used the official validation set of 50,000 images. ... LM1B (Chelba et al., 2014) ... We used the official training set and created validation and test sets using files news.en.heldout00000-of-00050 and news.en.heldout-00001-of-00050, respectively. These splits contain 30,301,028; 6,075; and 6,206 sentences, respectively. ... Common Crawl ... We randomly partitioned the sentences into a training set (99.98%) and a holdout set (0.02%). Our training set contains 25.8 billion sentences. We used the first 6,075 sentences of the holdout set as our validation set.
Hardware Specification No The paper mentions 'modern GPUs and custom accelerators' and 'TPU pods' as examples of systems where training time depends only on the number of training steps, but it does not specify the particular hardware used for their own experiments.
Software Dependencies No The paper references 'TensorFlow' multiple times, including its 'Momentum Optimizer class' and 'tf.image.per image standardization' operation, but it does not provide specific version numbers for TensorFlow or any other software libraries used. Footnote 19 mentions a TensorFlow T2T utility for processing data, but again, no specific version.
Experiment Setup Yes In all experiments, we independently tuned the metaparameters at each batch size, including the initial learning rate η0 and, when learning rate decay was used, the decay schedule (α, T). Also, unless otherwise specified, we used the Nesterov momentum optimizer (Sutskever et al., 2013) and tuned the momentum γ. We used quasirandom search (Bousquet et al., 2017) to tune the metaparameters with equal budgets of non-divergent13 trials for different batch sizes. ... We used decay for Res Net-8, Res Net-50, and VGG-11, which significantly reduced training time for those models. We selected our decay function by running an extensive set of experiments with Res Net-50 on Image Net (see Appendix C for details). We chose linear decay because it performed at least as well as all other schedules we tried, while also being the simplest and requiring only two additional metaparameters. ... We used label smoothing (Szegedy et al., 2016) to regularize training in our experiments with Res Net-50 on Image Net. ... We replaced batch normalization (Ioffe and Szegedy, 2015) with ghost batch normalization to keep the training objective fixed between batch sizes... We used a ghost batch size of 32 for all experiments. Appendix B provides full architectural details for each model, including number of layers, filter sizes, and activation functions.