Perplexed by Perplexity: Perplexity-Based Data Pruning With Small Reference Models
Authors: Zachary Ankner, Cody Blakeney, Kartik Sreenivasan, Max M Marion, Matthew Leavitt, Mansheej Paul
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We demonstrate that for multiple dataset compositions, perplexity-based pruning of pretraining data can significantly improve downstream task performance: pruning based on perplexities computed with a 125 million parameter model improves the average performance on downstream tasks of a 3 billion parameter model by up to 2.04 and achieves up to a 1.45 reduction in pretraining steps to reach commensurate baseline performance. We evaluate models on 33 different downstream question-answering tasks using the Mosaic ML evaluation gauntlet (Mosaic ML, 2023a). |
| Researcher Affiliation | Collaboration | Zachary Ankner 1,2 Cody Blakeney1 Kartik Sreenivasan1 Max Marion1 Matthew L. Leavitt3 Mansheej Paul1 1Databricks 2MIT 3Datology AI |
| Pseudocode | Yes | Algorithm 1: Psuedocode for performing perplexity-based data pruning. |
| Open Source Code | No | The paper mentions and cites 'llm-foundry' (Mosaic ML, 2023b) as a tool used for training, providing a GitHub link in the references. However, it does not provide an explicit statement of releasing the authors' specific implementation code for the methodology described in this paper, nor a direct link to a repository for their specific work. |
| Open Datasets | Yes | We consider two datasets in this work. The Pile (Gao et al., 2020) is composed of 22 different domains that range from general web scrapes to legal text. Dolma (Soldaini et al., 2024) is composed of 7 different domains and is derived mainly from general web scrapes. |
| Dataset Splits | No | First, we partition the original dataset into two splits: one for training the reference model and one for training the final model. We then prune the final model’s dataset split to a fraction of its original size, referred to as the selection rate (rs), by selecting samples according to a selection criteria which can be one of low, medium, or high. Algorithm 1 specifies `Dref, Dtrain = random_split(D, R)` where R is the 'reference training split size'. However, the specific value or proportion of R is not quantitatively provided in the text for their main pruning methodology. |
| Hardware Specification | Yes | Training is conducted using llm-foundry (Mosaic ML, 2023b) and using both Nvidia A100s and H100s. |
| Software Dependencies | No | All models are trained using the decoupled Lion optimizer (Chen et al., 2024) with a cosine learning rate schedule. Training is conducted using llm-foundry (Mosaic ML, 2023b). We tokenize all datasets using the GPT-4 tokenizer (Open AI, 2022). While software components like 'Lion optimizer', 'llm-foundry', and 'GPT-4 tokenizer' are mentioned, specific version numbers for these software dependencies are not provided. |
| Experiment Setup | Yes | All reference models have 125 million parameters, and we consider final models with 1 billion and 3 billion parameters. All reference models are trained for a fixed duration of 26 billion tokens. Unless otherwise specified, all final models are trained to Chinchilla optimal (Hoffmann et al., 2022), meaning that each final model’s training duration in tokens is 20 times its parameter count. All models are trained using the decoupled Lion optimizer (Chen et al., 2024) with a cosine learning rate schedule. All reference models and 1B parameter models are trained with a maximum learning rate and weight decay of 2e-4 and all 3B models are trained with a maximum learning rate and weight decay of 1.6e-4. We sweep across pruning selection criteria and selection rates (Section 7) and find that the best settings are to select high-perplexity samples at a 50% rate for the Pile and to select medium-perplexity samples at a 50% rate for Dolma. |