Text Quality-Based Pruning for Efficient Training of Language Models
Authors: Vasu Sharma, Karthik Padthe, Newsha Ardalani, Kushal Tirumala, Russell Howes, Hu Xu, Po-Yao Huang, Daniel Li Chen, Armen Aghajanyan, Gargi Ghosh, Luke Zettlemoyer
DMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experimental results over multiple models and datasets demonstrate the efficacy of this approach, showcasing substantial gains in training effectiveness and highlighting the potential for resource-efficient LM training. For example, we observe an absolute accuracy improvement of 0.9% averaged over 14 downstream evaluation tasks for multiple LM models while using 40% lesser data and training 42% faster when training on the Open Web Text dataset and 0.8% average absolute accuracy improvement while using 20% lesser data and training 21% faster on the Wikipedia dataset. |
| Researcher Affiliation | Industry | Vasu Sharma* EMAIL FAIR, Meta; Karthik Padthe* EMAIL FAIR, Meta; Newsha Ardalani EMAIL FAIR, Meta; Kushal Tirumala EMAIL FAIR, Meta; Russell Howes EMAIL FAIR, Meta; Hu Xu EMAIL FAIR, Meta; Po-Yao Huang EMAIL FAIR, Meta; Shang-Wen Li EMAIL FAIR, Meta; Armen Aghajanyan EMAIL FAIR, Meta; Gargi Ghosh EMAIL FAIR, Meta; Luke Zettlemoyer EMAIL FAIR, Meta |
| Pseudocode | No | The paper describes the methodology in prose and mathematical formulations (Equations 1, 2, 3) in section 2, 'Methodology'. There are no explicitly labeled pseudocode blocks or algorithm figures present in the document. |
| Open Source Code | No | The paper mentions using 'spacy Honnibal et al. (2020)' and 'Hugging Face based pre-trained language model Wolf et al. (2020)' for implementation, but it does not provide any statement or link for the open-sourcing of the authors' own methodology or code. |
| Open Datasets | Yes | We experiment with a english only versions of following datasets for our study: Wikipedia Tunstall et al. : This dataset is built from the wikipedia dump... Open Webtext Gokaslan et al. (2019): This dataset is the open source version of the Web Text dataset used for GPT-2 training. |
| Dataset Splits | Yes | We calculate validation perplexity for each of the dataset where validation set is 20% of the whole dataset sampled before pruning and is removed from the training data used for pruning. |
| Hardware Specification | No | The paper does not provide specific hardware details such as GPU models, CPU types, or memory configurations used for the experiments. |
| Software Dependencies | No | The paper mentions using 'spacy Honnibal et al. (2020)', 'Hugging Face based pre-trained language model Wolf et al. (2020)', and 'Hugging Face trainer' but does not specify the version numbers for these software components. |
| Experiment Setup | Yes | All the models are trained from scratch with 15 epochs and batch size of 128, we use Hugging Face trainer to train our models. |