DataDecide: How to Predict Best Pretraining Data with Small Experiments

Authors: Ian Magnusson, Nguyen Tai, Ben Bogin, David Heineman, Jena D. Hwang, Luca Soldaini, Akshita Bhagia, Jiacheng Liu, Dirk Groeneveld, Oyvind Tafjord, Noah A. Smith, Pang Wei Koh, Jesse Dodge

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We conduct controlled pretraining experiments across 25 corpora with differing sources, deduplication, and filtering up to 100B tokens, model sizes up to 1B parameters, and 3 random seeds. We find that the ranking of models at a single, small size (e.g., 150M parameters) is a strong baseline for predicting best models at our larger target scale (1B) ( 80% of comparisons correct).
Researcher Affiliation Collaboration 1Allen Institute for AI 2Paul G. Allen School of Computer Science & Engineering, University of Washington 3University of Pennsylvania.
Pseudocode No The paper describes mathematical formulas (Equations 1 and 2) and variations of scaling law methods in prose. It does not contain any clearly labeled pseudocode or algorithm blocks with structured steps.
Open Source Code No The paper states: 'To empower open exploration of this question, we release models, data, and evaluations in DATADECIDE'. It also states: 'We will release all models, checkpoints, pretraining corpora, and evaluations.' and 'We share the benefit of this cost through releasing all of our models, data, and evaluations'. While models and data are released, there is no explicit statement or link confirming the release of the source code for the methodology or implementation of their prediction methods.
Open Datasets Yes To empower open exploration of this question, we release models, data, and evaluations in DATADECIDE the most extensive open suite of models over differences in data and scale. We will release all models, checkpoints, pretraining corpora, and evaluations. Table 1: We release all pretraining corpora, as well as models trained on each recipe and each of the 14 model configurations in Table 2 with 3 random seeds.
Dataset Splits Yes We use the OLMES suite of 10 multiple choice question answering benchmarks (Gu et al., 2024)... The underlying metric for each task is accuracy... We diverge from OLMES only in that we make use of all available items in the specified split of each benchmark rather than subsampling them, to reduce variance over the task distribution.
Hardware Specification Yes The pretraining experiments in our DATADECIDE required approximately 820K H100 GPU hours.
Software Dependencies No The paper mentions using OLMo’s model ladder and heuristics from Porian et al. (2024) for configurations, but it does not specify version numbers for any software, libraries, or frameworks used in the experiments.
Experiment Setup Yes We select a token to parameter ratio of 100... All 1B (target size) models have 3 full reruns with different seeds... The model ladder uses heuristics from the literature (Porian et al., 2024) to set global batch size and learning rate based on scaling factors. The hyperparameters that determine parameter count (layers, hidden dimension, number of heads, MLP dimension) were handpicked by OLMo developers for each scale... Appendix Table 2 details the configurations of all our models.