reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

DataDecide: How to Predict Best Pretraining Data with Small Experiments

Authors: Ian Magnusson, Nguyen Tai, Ben Bogin, David Heineman, Jena D. Hwang, Luca Soldaini, Akshita Bhagia, Jiacheng Liu, Dirk Groeneveld, Oyvind Tafjord, Noah A. Smith, Pang Wei Koh, Jesse Dodge

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We conduct controlled pretraining experiments across 25 corpora with differing sources, deduplication, and filtering up to 100B tokens, model sizes up to 1B parameters, and 3 random seeds. We find that the ranking of models at a single, small size (e.g., 150M parameters) is a strong baseline for predicting best models at our larger target scale (1B) ( 80% of comparisons correct).
Researcher Affiliation	Collaboration	1Allen Institute for AI 2Paul G. Allen School of Computer Science & Engineering, University of Washington 3University of Pennsylvania.
Pseudocode	No	The paper describes mathematical formulas (Equations 1 and 2) and variations of scaling law methods in prose. It does not contain any clearly labeled pseudocode or algorithm blocks with structured steps.
Open Source Code	No	The paper states: 'To empower open exploration of this question, we release models, data, and evaluations in DATADECIDE'. It also states: 'We will release all models, checkpoints, pretraining corpora, and evaluations.' and 'We share the benefit of this cost through releasing all of our models, data, and evaluations'. While models and data are released, there is no explicit statement or link confirming the release of the source code for the methodology or implementation of their prediction methods.
Open Datasets	Yes	To empower open exploration of this question, we release models, data, and evaluations in DATADECIDE the most extensive open suite of models over differences in data and scale. We will release all models, checkpoints, pretraining corpora, and evaluations. Table 1: We release all pretraining corpora, as well as models trained on each recipe and each of the 14 model configurations in Table 2 with 3 random seeds.
Dataset Splits	Yes	We use the OLMES suite of 10 multiple choice question answering benchmarks (Gu et al., 2024)... The underlying metric for each task is accuracy... We diverge from OLMES only in that we make use of all available items in the specified split of each benchmark rather than subsampling them, to reduce variance over the task distribution.
Hardware Specification	Yes	The pretraining experiments in our DATADECIDE required approximately 820K H100 GPU hours.
Software Dependencies	No	The paper mentions using OLMo’s model ladder and heuristics from Porian et al. (2024) for configurations, but it does not specify version numbers for any software, libraries, or frameworks used in the experiments.
Experiment Setup	Yes	We select a token to parameter ratio of 100... All 1B (target size) models have 3 full reruns with different seeds... The model ladder uses heuristics from the literature (Porian et al., 2024) to set global batch size and learning rate based on scaling factors. The hyperparameters that determine parameter count (layers, hidden dimension, number of heads, MLP dimension) were handpicked by OLMo developers for each scale... Appendix Table 2 details the configurations of all our models.