Perplexed by Perplexity: Perplexity-Based Data Pruning With Small Reference Models

Authors: Zachary Ankner, Cody Blakeney, Kartik Sreenivasan, Max M Marion, Matthew Leavitt, Mansheej Paul

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We demonstrate that for multiple dataset compositions, perplexity-based pruning of pretraining data can significantly improve downstream task performance: pruning based on perplexities computed with a 125 million parameter model improves the average performance on downstream tasks of a 3 billion parameter model by up to 2.04 and achieves up to a 1.45 reduction in pretraining steps to reach commensurate baseline performance. We evaluate models on 33 different downstream question-answering tasks using the Mosaic ML evaluation gauntlet (Mosaic ML, 2023a).
Researcher Affiliation Collaboration Zachary Ankner 1,2 Cody Blakeney1 Kartik Sreenivasan1 Max Marion1 Matthew L. Leavitt3 Mansheej Paul1 1Databricks 2MIT 3Datology AI
Pseudocode Yes Algorithm 1: Psuedocode for performing perplexity-based data pruning.
Open Source Code No The paper mentions and cites 'llm-foundry' (Mosaic ML, 2023b) as a tool used for training, providing a GitHub link in the references. However, it does not provide an explicit statement of releasing the authors' specific implementation code for the methodology described in this paper, nor a direct link to a repository for their specific work.
Open Datasets Yes We consider two datasets in this work. The Pile (Gao et al., 2020) is composed of 22 different domains that range from general web scrapes to legal text. Dolma (Soldaini et al., 2024) is composed of 7 different domains and is derived mainly from general web scrapes.
Dataset Splits No First, we partition the original dataset into two splits: one for training the reference model and one for training the final model. We then prune the final model’s dataset split to a fraction of its original size, referred to as the selection rate (rs), by selecting samples according to a selection criteria which can be one of low, medium, or high. Algorithm 1 specifies `Dref, Dtrain = random_split(D, R)` where R is the 'reference training split size'. However, the specific value or proportion of R is not quantitatively provided in the text for their main pruning methodology.
Hardware Specification Yes Training is conducted using llm-foundry (Mosaic ML, 2023b) and using both Nvidia A100s and H100s.
Software Dependencies No All models are trained using the decoupled Lion optimizer (Chen et al., 2024) with a cosine learning rate schedule. Training is conducted using llm-foundry (Mosaic ML, 2023b). We tokenize all datasets using the GPT-4 tokenizer (Open AI, 2022). While software components like 'Lion optimizer', 'llm-foundry', and 'GPT-4 tokenizer' are mentioned, specific version numbers for these software dependencies are not provided.
Experiment Setup Yes All reference models have 125 million parameters, and we consider final models with 1 billion and 3 billion parameters. All reference models are trained for a fixed duration of 26 billion tokens. Unless otherwise specified, all final models are trained to Chinchilla optimal (Hoffmann et al., 2022), meaning that each final model’s training duration in tokens is 20 times its parameter count. All models are trained using the decoupled Lion optimizer (Chen et al., 2024) with a cosine learning rate schedule. All reference models and 1B parameter models are trained with a maximum learning rate and weight decay of 2e-4 and all 3B models are trained with a maximum learning rate and weight decay of 1.6e-4. We sweep across pruning selection criteria and selection rates (Section 7) and find that the best settings are to select high-perplexity samples at a 50% rate for the Pile and to select medium-perplexity samples at a 50% rate for Dolma.