reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Improving Pretraining Data Using Perplexity Correlations

Authors: Tristan Thrush, Christopher Potts, Tatsunori Hashimoto

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In controlled pretraining experiments at the 160M parameter scale on 8 benchmarks, our approach outperforms DSIR on every benchmark, while matching the best data selector found in Data Comp-LM, a hand-engineered bigram classifier. We have now also updated this paper to include results from preregistered experiments with new pretraining data on an aggregation of 22 benchmarks up to the 1.4B scale, showing increasing improvements of our method over others with more scale.
Researcher Affiliation	Academia	Tristan Thrush, Christopher Potts & Tatsunori Hashimoto Department of Computer Science Stanford University Stanford, CA 94305, USA EMAIL
Pseudocode	Yes	We show this process explicitly with pseudocode in Algorithm 1 (see Appendix A)
Open Source Code	Yes	A pip package with full documentation can be found here: https://github.com/Tristan Thrush/perplexity-correlations.
Open Datasets	Yes	For our initial experiments, we collected these values on the sample subset2 of the Red Pajama V2 (RPJv2) dataset (Together Computer, 2023)...2https://huggingface.co/datasets/togethercomputer/Red Pajama-Data-V2
Dataset Splits	Yes	Figure 4 shows 5-fold leave-out plots for PIQA, and LAMBADAFR with rank predictions given by ˆθproj, Φ(x) . Every point in the plot is a held-out point: we estimated θ five times, holding out a different 20% of the data each time, and plotted the held-out predictions. For every data selection method that we tested, the task was to further select 3.2B tokens for pretraining. For efficiency, we set the sample limit to 5000 examples per benchmark.
Hardware Specification	Yes	We trained each LLM on 4 NVIDIA A100 GPUs.
Software Dependencies	No	We trained each LLM on 4 NVIDIA A100 GPUs. At 3.2B tokens, each training run took under 3 hours with the Hugging Face Trainer (Wolf et al., 2019) and appropriate Py Torch (Ansel et al., 2024) compile flags.
Experiment Setup	Yes	We provide pretraining hyperparameters in Table 2. Parameter Value Per-device Batch Size 128 Learning Rate 5 10 3 Warmup Ratio 0.1 Adam β1 0.9 Adam β2 0.95 Adam ϵ 1 10 8 Weight Decay 0.1 LR Scheduler cosine Max Grad Norm 1.0 Distributed Backend nccl Gradient Accumulation Steps 1