Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1]

Structural Pruning of Pre-trained Language Models via Neural Architecture Search

Authors: Aaron Klein, Jacek Golebiowski, Xingchen Ma, Valerio Perrone, Cedric Archambeau

TMLR 2024 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate different types of NAS for structural pruning on eight text classification tasks, including textual entailment, sentiment analysis and multiple-choice question / answering. We provide a detailed description of each task in Appendix C. All tasks come with a predefined training and evaluation set with labels and a hold-out test set without labels. We split the training set into a training and validation set (70%/30% split) and use the evaluation set as test set. We fine-tune every network, sub-network or super-network, for 5 epochs on a single GPU. For all multi-objective search methods, we use Syne Tune (Salinas et al., 2022) on a single GPU instance. We use BERT-base (Devlin et al., 2019) (cased) and Ro BERTa-base (Liu et al., 2019b) as pre-trained network, which consists of L = 12 layers, I = 3072 units and H = 12 heads (other hyperparameters are described in Appendix A). While arguably rather small for today s standards, they still achieve competitive performance on these benchmarks and allow for a more thorough evaluation.
Researcher Affiliation Industry Aaron Klein EMAIL Amazon Web Services Jacek Golebiowski EMAIL Amazon Web Services Xingchen Ma EMAIL Amazon Web Services Valerio Perrone EMAIL Amazon Web Services Cedric Archambeau EMAIL Helsing
Pseudocode Yes Algorithm 1, 2, 3 and 4 show pseudo code for the LAYER, SMALL, MEDIUM and LARGE search space, respectively.
Open Source Code Yes Code is available at https://github.com/whittleorg/plm_pruning.
Open Datasets Yes We use the following 10 dataset test classification datasets. All dataset are classification tasks, except for STSB, which is a regression dataset. The Recognizing Textual Entailment (RTE) dataset... The Microsoft Research Paraphrase Corpus (MRPC) dataset... The Semantic Textual Similarity Benchmark (STSB)... The Corpus of Linguistics Acceptability (COLA) dataset... The IMDB dataset... The Stanford Sentiment Treebank (SST2)... Situations With Adversarial Generations (SWAG) dataset... QNLI is a modified version of the Stanford Question Answering Dataset...
Dataset Splits Yes We split the training set into a training and validation set (70%/30% split) and use the evaluation set as test set.
Hardware Specification Yes We fine-tune every network, sub-network or super-network, for 5 epochs on a single GPU. For all multi-objective search methods, we use Syne Tune (Salinas et al., 2022) on a single GPU instance. ...Test error versus memory footprint (left) and latency (right) on 3 different GPU types for the Pareto front found by our NAS strategy and the un-pruned network with 8bit and 4bit quantization. T4 A10 V100
Software Dependencies No Table A shows the hyperparameters for fine-tuning the super-network. We largely follow default hyperparameters recommended by the Hugging Face transformers library. For all multi-objective search method, we follow the default hyperparameter of Syne Tune.
Experiment Setup Yes We fine-tune every network, sub-network or super-network, for 5 epochs on a single GPU. For all multi-objective search methods, we use Syne Tune (Salinas et al., 2022) on a single GPU instance. ...Table A shows the hyperparameters for fine-tuning the super-network. We largely follow default hyperparameters recommended by the Hugging Face transformers library. For all multi-objective search method, we follow the default hyperparameter of Syne Tune. Hyperparameter Value Learning Rate 0.00002 Number of random sub-networks k 2 Temperature T 10 Batch Size 4