Compute-Efficient Deep Learning: Algorithmic Trends and Opportunities

Authors: Brian R. Bartoldson, Bhavya Kailkhura, Davis Blalock

JMLR 2023 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Davis conducted all experiments and led the creation of a guide to achieving speedups in practice. To address these fragmentation issues, we eschew a more traditional survey approach that focuses on just a single component (e.g., the model) or single action (e.g., reducing model size) in the training pipeline. Instead, we adopt a wholistic view of the speedup problem and emphasize that one needs to carefully select a combination of techniques, which we survey in Section 3, to overcome various compute-platform bottlenecks. We use experiments to illustrate the importance of such a wholistic view to achieving speedup in practice, and we provide guidance informed by the relationships between different bottlenecks and components of training. Appendix B. Experimental Details: All models were trained on a single machine with eight A100s and two 32-core AMD EPYC 7513 processors.
Researcher Affiliation Collaboration Brian R. Bartoldson EMAIL Lawrence Livermore National Laboratory, USA; Bhavya Kailkhura EMAIL Lawrence Livermore National Laboratory, USA; Davis Blalock Davis@Mosaic ML.com Mosaic ML, USA
Pseudocode No The paper describes various algorithms and methods but does not provide structured pseudocode blocks or algorithms labeled as such. It mainly surveys existing techniques and discusses their mechanisms.
Open Source Code No The paper does not provide a specific link to a code repository for its own methodology or an explicit statement of code release. It mentions using existing libraries like Composer (Team, 2021) for experiments, but this refers to third-party tools rather than the authors' own implementation for the paper's novel contributions: 'We chose these methods because all had tested implementations in a common library.' The license information pertains to the paper itself, not source code.
Open Datasets Yes Experiments on CIFAR-10, CIFAR-100, and SVHN show that Selective-Backprop can achieve a 3.5x speedup compared to standard SGD in exchange for a decrease in accuracy. On CIFAR-10, CIFAR-100, and FOOD-101(N), they found that scoring and ordering had no effect on model quality at convergence. Sorscher et al. (2022) evaluate many existing data pruning metrics, finding that 1) they perform poorly on Image Net, even if they performed well on smaller datasets like CIFAR-10. Dubois et al. (2021) provides a minimal script that trains an image encoder, encodes the STL dataset, and trains a linear classifier on the resulting encodings to 98.7% accuracy in under five minutes. Gonzalez and Miikkulainen (2020) apply genetic programming to learn loss functions from primitive operations using MNIST validation dataset performance as a signal.
Dataset Splits No The paper mentions using well-known datasets like ImageNet, CIFAR-10, and CIFAR-100, which typically have standard splits. It also mentions
Hardware Specification Yes Appendix B. Experimental Details: All models were trained on a single machine with eight A100s and two 32-core AMD EPYC 7513 processors. Microbenchmarking experiments used a single A100, with means and standard deviations computed from five trials. All results use half-precision weights and activations.
Software Dependencies No The paper mentions using PyTorch for microbenchmarking: 'we profile individual Py Torch (Paszke et al., 2017) operations on a 40GB A100'. It also refers to 'Composer (Team, 2021)' as a common library where some methods are implemented. However, specific version numbers for PyTorch, Composer, or other critical software dependencies are not provided within the text.
Experiment Setup No Appendix B. Experimental Details: We chose the hyperparameters for speedup methods in Figure 13 based on the hyperparameters used for these recipes. When the hyperparameters did not vary across recipes for a given speedup, we made up similar hyperparameters on a best-effort basis that would allow for assessing alternate speed vs accuracy tradeoffs e.g., choosing different degrees of progressive resizing. These hyperparameters may not be optimal, so it is important to conclude only that certain baselines can outperform these methods, not that they always will.