Improving Pretraining Data Using Perplexity Correlations

Authors: Tristan Thrush, Christopher Potts, Tatsunori Hashimoto

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In controlled pretraining experiments at the 160M parameter scale on 8 benchmarks, our approach outperforms DSIR on every benchmark, while matching the best data selector found in Data Comp-LM, a hand-engineered bigram classifier. We have now also updated this paper to include results from preregistered experiments with new pretraining data on an aggregation of 22 benchmarks up to the 1.4B scale, showing increasing improvements of our method over others with more scale.
Researcher Affiliation Academia Tristan Thrush, Christopher Potts & Tatsunori Hashimoto Department of Computer Science Stanford University Stanford, CA 94305, USA EMAIL
Pseudocode Yes We show this process explicitly with pseudocode in Algorithm 1 (see Appendix A)
Open Source Code Yes A pip package with full documentation can be found here: https://github.com/Tristan Thrush/perplexity-correlations.
Open Datasets Yes For our initial experiments, we collected these values on the sample subset2 of the Red Pajama V2 (RPJv2) dataset (Together Computer, 2023)...2https://huggingface.co/datasets/togethercomputer/Red Pajama-Data-V2
Dataset Splits Yes Figure 4 shows 5-fold leave-out plots for PIQA, and LAMBADAFR with rank predictions given by ˆθproj, Φ(x) . Every point in the plot is a held-out point: we estimated θ five times, holding out a different 20% of the data each time, and plotted the held-out predictions. For every data selection method that we tested, the task was to further select 3.2B tokens for pretraining. For efficiency, we set the sample limit to 5000 examples per benchmark.
Hardware Specification Yes We trained each LLM on 4 NVIDIA A100 GPUs.
Software Dependencies No We trained each LLM on 4 NVIDIA A100 GPUs. At 3.2B tokens, each training run took under 3 hours with the Hugging Face Trainer (Wolf et al., 2019) and appropriate Py Torch (Ansel et al., 2024) compile flags.
Experiment Setup Yes We provide pretraining hyperparameters in Table 2. Parameter Value Per-device Batch Size 128 Learning Rate 5 10 3 Warmup Ratio 0.1 Adam β1 0.9 Adam β2 0.95 Adam ϵ 1 10 8 Weight Decay 0.1 LR Scheduler cosine Max Grad Norm 1.0 Distributed Backend nccl Gradient Accumulation Steps 1