reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Measuring the Effects of Data Parallelism on Neural Network Training

Authors: Christopher J. Shallue, Jaehoon Lee, Joseph Antognini, Jascha Sohl-Dickstein, Roy Frostig, George E. Dahl

JMLR 2019 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this work, we aim to experimentally characterize the eﬀects of increasing the batch size on training time, as measured by the number of steps necessary to reach a goal out-of-sample error. We study how this relationship varies with the training algorithm, model, and data set, and ﬁnd extremely large variation between workloads. Along the way, we show that disagreements in the literature on how batch size aﬀects model quality can largely be explained by diﬀerences in metaparameter tuning and compute budgets at diﬀerent batch sizes. We ﬁnd no evidence that larger batch sizes degrade out-of-sample performance. Finally, we discuss the implications of our results on eﬀorts to train neural networks much faster in the future. Our experimental data is publicly available as a database of 71,638,836 loss measurements taken over the course of training for 168,160 individual models across 35 workloads.
Researcher Affiliation	Industry	Google Brain 1600 Amphiteatre Parkway Mountain View, CA, 94043, USA
Pseudocode	No	The paper describes algorithms like SGD, SGD with momentum, and Nesterov momentum using mathematical equations and update rules in Section 2.2 Algorithms, but it does not present any structured pseudocode blocks or algorithms labeled as such.
Open Source Code	Yes	We release our raw experimental data for any further analysis by the research community.3 ... 3 https://github.com/google-research/google-research/tree/master/batch_science ... 4 https://colab.research.google.com/github/google-research/google-research/blob/master/ batch_science/reproduce_paper_plots.ipynb
Open Datasets	Yes	Our database contains 454 combinations of workload (model, data set, training algorithm) and batch size, each of which is associated with a metaparameter search space and a set of models trained with diﬀerent conﬁgurations sampled from the search space. In total, our data contains 71,638,836 loss measurements taken over the course of training for 168,160 individual models. Together, these measurements make up the training curves of all of the individual models we trained, and can be used to reproduce all plots in this paper.4 ... We used seven image and text data sets with training set sizes ranging from 45,000 to 26 billion examples. Table 1 summarizes these data sets and Appendix A provides the full details. ... MNIST (Le Cun et al., 1998) ... Fashion MNIST (Xiao et al., 2017) ... CIFAR-10 (Krizhevsky, 2009) ... Image Net (Russakovsky et al., 2015) ... Open Images v4 (Krasin et al., 2017) ... LM1B (Chelba et al., 2014) ... Common Crawl ... All available at their respective links provided in footnotes or standard citations.
Dataset Splits	Yes	MNIST (Le Cun et al., 1998) is a classic handwritten digit image classiﬁcation data set with 10 mutually exclusive classes. We split the original training set into 55,000 training images and 5,000 validation images, and used the oﬃcial test set of 10,000 images. ... Fashion MNIST (Xiao et al., 2017) ... We split the original training set into 55,000 training images and 5,000 validation images, and used the oﬃcial test set of 10,000 images. ... CIFAR-10 (Krizhevsky, 2009) ... We split the original training set into 45,000 training images and 5,000 validation images. We used the oﬃcial test set of 10,000 images. ... Image Net (Russakovsky et al., 2015) ... We split the oﬃcial training set into 1,281,167 training images and 50,045 test images, and used the oﬃcial validation set of 50,000 images. ... LM1B (Chelba et al., 2014) ... We used the oﬃcial training set and created validation and test sets using ﬁles news.en.heldout00000-of-00050 and news.en.heldout-00001-of-00050, respectively. These splits contain 30,301,028; 6,075; and 6,206 sentences, respectively. ... Common Crawl ... We randomly partitioned the sentences into a training set (99.98%) and a holdout set (0.02%). Our training set contains 25.8 billion sentences. We used the ﬁrst 6,075 sentences of the holdout set as our validation set.
Hardware Specification	No	The paper mentions 'modern GPUs and custom accelerators' and 'TPU pods' as examples of systems where training time depends only on the number of training steps, but it does not specify the particular hardware used for their own experiments.
Software Dependencies	No	The paper references 'TensorFlow' multiple times, including its 'Momentum Optimizer class' and 'tf.image.per image standardization' operation, but it does not provide specific version numbers for TensorFlow or any other software libraries used. Footnote 19 mentions a TensorFlow T2T utility for processing data, but again, no specific version.
Experiment Setup	Yes	In all experiments, we independently tuned the metaparameters at each batch size, including the initial learning rate η0 and, when learning rate decay was used, the decay schedule (α, T). Also, unless otherwise speciﬁed, we used the Nesterov momentum optimizer (Sutskever et al., 2013) and tuned the momentum γ. We used quasirandom search (Bousquet et al., 2017) to tune the metaparameters with equal budgets of non-divergent13 trials for diﬀerent batch sizes. ... We used decay for Res Net-8, Res Net-50, and VGG-11, which signiﬁcantly reduced training time for those models. We selected our decay function by running an extensive set of experiments with Res Net-50 on Image Net (see Appendix C for details). We chose linear decay because it performed at least as well as all other schedules we tried, while also being the simplest and requiring only two additional metaparameters. ... We used label smoothing (Szegedy et al., 2016) to regularize training in our experiments with Res Net-50 on Image Net. ... We replaced batch normalization (Ioﬀe and Szegedy, 2015) with ghost batch normalization to keep the training objective ﬁxed between batch sizes... We used a ghost batch size of 32 for all experiments. Appendix B provides full architectural details for each model, including number of layers, filter sizes, and activation functions.