Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1]
Structural Pruning of Pre-trained Language Models via Neural Architecture Search
Authors: Aaron Klein, Jacek Golebiowski, Xingchen Ma, Valerio Perrone, Cedric Archambeau
TMLR 2024 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate different types of NAS for structural pruning on eight text classification tasks, including textual entailment, sentiment analysis and multiple-choice question / answering. We provide a detailed description of each task in Appendix C. All tasks come with a predefined training and evaluation set with labels and a hold-out test set without labels. We split the training set into a training and validation set (70%/30% split) and use the evaluation set as test set. We fine-tune every network, sub-network or super-network, for 5 epochs on a single GPU. For all multi-objective search methods, we use Syne Tune (Salinas et al., 2022) on a single GPU instance. We use BERT-base (Devlin et al., 2019) (cased) and Ro BERTa-base (Liu et al., 2019b) as pre-trained network, which consists of L = 12 layers, I = 3072 units and H = 12 heads (other hyperparameters are described in Appendix A). While arguably rather small for today s standards, they still achieve competitive performance on these benchmarks and allow for a more thorough evaluation. |
| Researcher Affiliation | Industry | Aaron Klein EMAIL Amazon Web Services Jacek Golebiowski EMAIL Amazon Web Services Xingchen Ma EMAIL Amazon Web Services Valerio Perrone EMAIL Amazon Web Services Cedric Archambeau EMAIL Helsing |
| Pseudocode | Yes | Algorithm 1, 2, 3 and 4 show pseudo code for the LAYER, SMALL, MEDIUM and LARGE search space, respectively. |
| Open Source Code | Yes | Code is available at https://github.com/whittleorg/plm_pruning. |
| Open Datasets | Yes | We use the following 10 dataset test classification datasets. All dataset are classification tasks, except for STSB, which is a regression dataset. The Recognizing Textual Entailment (RTE) dataset... The Microsoft Research Paraphrase Corpus (MRPC) dataset... The Semantic Textual Similarity Benchmark (STSB)... The Corpus of Linguistics Acceptability (COLA) dataset... The IMDB dataset... The Stanford Sentiment Treebank (SST2)... Situations With Adversarial Generations (SWAG) dataset... QNLI is a modified version of the Stanford Question Answering Dataset... |
| Dataset Splits | Yes | We split the training set into a training and validation set (70%/30% split) and use the evaluation set as test set. |
| Hardware Specification | Yes | We fine-tune every network, sub-network or super-network, for 5 epochs on a single GPU. For all multi-objective search methods, we use Syne Tune (Salinas et al., 2022) on a single GPU instance. ...Test error versus memory footprint (left) and latency (right) on 3 different GPU types for the Pareto front found by our NAS strategy and the un-pruned network with 8bit and 4bit quantization. T4 A10 V100 |
| Software Dependencies | No | Table A shows the hyperparameters for fine-tuning the super-network. We largely follow default hyperparameters recommended by the Hugging Face transformers library. For all multi-objective search method, we follow the default hyperparameter of Syne Tune. |
| Experiment Setup | Yes | We fine-tune every network, sub-network or super-network, for 5 epochs on a single GPU. For all multi-objective search methods, we use Syne Tune (Salinas et al., 2022) on a single GPU instance. ...Table A shows the hyperparameters for fine-tuning the super-network. We largely follow default hyperparameters recommended by the Hugging Face transformers library. For all multi-objective search method, we follow the default hyperparameter of Syne Tune. Hyperparameter Value Learning Rate 0.00002 Number of random sub-networks k 2 Temperature T 10 Batch Size 4 |