Accurate Neural Network Pruning Requires Rethinking Sparse Optimization

Authors: Denis Kuznedelev, Eldar Kurtic, Eugenia Iofinova, Elias Frantar, Alexandra Peste, Dan Alistarh

TMLR 2024 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We examine the impact of high sparsity on model training using the standard computer vision and natural language processing sparsity benchmarks. We begin by showing that using standard dense training recipes for sparse training is suboptimal, and provide evidence that this results in undertraining, loosely defined as using a suboptimal number of passes over the training data. We present training recipes for mitigating this issue for both sparse pre-training of vision models (e.g. Res Net50/Image Net) and sparse fine-tuning of language models (e.g. BERT/GLUE), achieving state-of-the-art results in both settings in the high-sparsity regime, and providing detailed analyses for the difficulty of sparse training in both scenarios.
Researcher Affiliation Collaboration Denis Kuznedelev EMAIL Skoltech & Yandex Eldar Kurtic EMAIL IST Austria Eugenia Iofinova EMAIL IST Austria Elias Frantar EMAIL IST Austria Alexandra Peste EMAIL IST Austria Dan Alistarh EMAIL IST Austria & Neural Magic
Pseudocode No The paper describes methods such as AC/DC and Rig L in prose and provides equations, but does not contain any structured pseudocode or algorithm blocks.
Open Source Code No The paper mentions using and comparing against existing methods like Rig L (Evci et al., 2020) and AC/DC (Peste et al., 2021) and refers to external resources like FFCV library (Leclerc et al., 2022) and www.github.com/google-research/rigl. However, there is no explicit statement from the authors about releasing the code for their own proposed methodology (AC/DC++) or a link to their own repository.
Open Datasets Yes image classification using the Res Net50 model (He et al., 2016) on the Image Net-1K dataset (Russakovsky et al., 2015) ... and language modelling using the BERT-base model (Devlin et al., 2019) on the GLUE benchmark datasets (Wang et al., 2018). ... In Appendix Section C, we consider the alternative Validation entropy metric, and present a similar validation on the Celeb-A dataset. ... We test robustness by measuring model performance on the Image Net-C dataset (Hendrycks & Dietterich, 2019).
Dataset Splits Yes We examine validation accuracy on trained sparse and dense Res Net50 models on the Image Net-1K dataset and compare it to the train loss on the last epoch of training. ... For fair comparisons with results from prior work, we employ early stopping for all methods. ... On the dev-set of the corresponding GLUE task ... We provide more details about each dataset in Appendix O. ... We test robustness by measuring model performance on the Image Net-C dataset (Hendrycks & Dietterich, 2019), which digitally adds 19 types of perturbations to the Image Net-1K validation set.
Hardware Specification Yes We note that each of the experiments presented in the paper takes less than a day on a standard 8-GPU NVIDIA RTX 3090 server.
Software Dependencies No The paper mentions using the 'Pytorch FFCV package (Leclerc et al., 2022)' and 'Adam optimizer (Kingma & Ba, 2015)', but it does not specify version numbers for PyTorch, FFCV, or any other software libraries or frameworks used in their implementation.
Experiment Setup Yes All models were trained using standard hyperparameters (see Appendix A) except for the difference in number of training of epochs in different experiments. ... In addition to an extended training schedule, we use label smoothing and a linear learning rate decay with warm-up, as well as progressive resizing of input samples. ... AC/DC phase duration. In Appendix I we confirm that for Res Net50 models trained on Image Net and assuming equally-sized compression and decompression phases, the 5-epoch phase duration used in the initial paper is optimal. ... We search and tune the initial learning rate in {1e-4, 2e-4, 3e-4}, and dropout in {0.05, 0.1}, and report mean performance over the two best runs.