Text Quality-Based Pruning for Efficient Training of Language Models

Authors: Vasu Sharma, Karthik Padthe, Newsha Ardalani, Kushal Tirumala, Russell Howes, Hu Xu, Po-Yao Huang, Daniel Li Chen, Armen Aghajanyan, Gargi Ghosh, Luke Zettlemoyer

DMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimental results over multiple models and datasets demonstrate the efficacy of this approach, showcasing substantial gains in training effectiveness and highlighting the potential for resource-efficient LM training. For example, we observe an absolute accuracy improvement of 0.9% averaged over 14 downstream evaluation tasks for multiple LM models while using 40% lesser data and training 42% faster when training on the Open Web Text dataset and 0.8% average absolute accuracy improvement while using 20% lesser data and training 21% faster on the Wikipedia dataset.
Researcher Affiliation Industry Vasu Sharma* EMAIL FAIR, Meta; Karthik Padthe* EMAIL FAIR, Meta; Newsha Ardalani EMAIL FAIR, Meta; Kushal Tirumala EMAIL FAIR, Meta; Russell Howes EMAIL FAIR, Meta; Hu Xu EMAIL FAIR, Meta; Po-Yao Huang EMAIL FAIR, Meta; Shang-Wen Li EMAIL FAIR, Meta; Armen Aghajanyan EMAIL FAIR, Meta; Gargi Ghosh EMAIL FAIR, Meta; Luke Zettlemoyer EMAIL FAIR, Meta
Pseudocode No The paper describes the methodology in prose and mathematical formulations (Equations 1, 2, 3) in section 2, 'Methodology'. There are no explicitly labeled pseudocode blocks or algorithm figures present in the document.
Open Source Code No The paper mentions using 'spacy Honnibal et al. (2020)' and 'Hugging Face based pre-trained language model Wolf et al. (2020)' for implementation, but it does not provide any statement or link for the open-sourcing of the authors' own methodology or code.
Open Datasets Yes We experiment with a english only versions of following datasets for our study: Wikipedia Tunstall et al. : This dataset is built from the wikipedia dump... Open Webtext Gokaslan et al. (2019): This dataset is the open source version of the Web Text dataset used for GPT-2 training.
Dataset Splits Yes We calculate validation perplexity for each of the dataset where validation set is 20% of the whole dataset sampled before pruning and is removed from the training data used for pruning.
Hardware Specification No The paper does not provide specific hardware details such as GPU models, CPU types, or memory configurations used for the experiments.
Software Dependencies No The paper mentions using 'spacy Honnibal et al. (2020)', 'Hugging Face based pre-trained language model Wolf et al. (2020)', and 'Hugging Face trainer' but does not specify the version numbers for these software components.
Experiment Setup Yes All the models are trained from scratch with 15 epochs and batch size of 128, we use Hugging Face trainer to train our models.