DataDecide: How to Predict Best Pretraining Data with Small Experiments
Authors: Ian Magnusson, Nguyen Tai, Ben Bogin, David Heineman, Jena D. Hwang, Luca Soldaini, Akshita Bhagia, Jiacheng Liu, Dirk Groeneveld, Oyvind Tafjord, Noah A. Smith, Pang Wei Koh, Jesse Dodge
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We conduct controlled pretraining experiments across 25 corpora with differing sources, deduplication, and filtering up to 100B tokens, model sizes up to 1B parameters, and 3 random seeds. We find that the ranking of models at a single, small size (e.g., 150M parameters) is a strong baseline for predicting best models at our larger target scale (1B) ( 80% of comparisons correct). |
| Researcher Affiliation | Collaboration | 1Allen Institute for AI 2Paul G. Allen School of Computer Science & Engineering, University of Washington 3University of Pennsylvania. |
| Pseudocode | No | The paper describes mathematical formulas (Equations 1 and 2) and variations of scaling law methods in prose. It does not contain any clearly labeled pseudocode or algorithm blocks with structured steps. |
| Open Source Code | No | The paper states: 'To empower open exploration of this question, we release models, data, and evaluations in DATADECIDE'. It also states: 'We will release all models, checkpoints, pretraining corpora, and evaluations.' and 'We share the benefit of this cost through releasing all of our models, data, and evaluations'. While models and data are released, there is no explicit statement or link confirming the release of the source code for the methodology or implementation of their prediction methods. |
| Open Datasets | Yes | To empower open exploration of this question, we release models, data, and evaluations in DATADECIDE the most extensive open suite of models over differences in data and scale. We will release all models, checkpoints, pretraining corpora, and evaluations. Table 1: We release all pretraining corpora, as well as models trained on each recipe and each of the 14 model configurations in Table 2 with 3 random seeds. |
| Dataset Splits | Yes | We use the OLMES suite of 10 multiple choice question answering benchmarks (Gu et al., 2024)... The underlying metric for each task is accuracy... We diverge from OLMES only in that we make use of all available items in the specified split of each benchmark rather than subsampling them, to reduce variance over the task distribution. |
| Hardware Specification | Yes | The pretraining experiments in our DATADECIDE required approximately 820K H100 GPU hours. |
| Software Dependencies | No | The paper mentions using OLMo’s model ladder and heuristics from Porian et al. (2024) for configurations, but it does not specify version numbers for any software, libraries, or frameworks used in the experiments. |
| Experiment Setup | Yes | We select a token to parameter ratio of 100... All 1B (target size) models have 3 full reruns with different seeds... The model ladder uses heuristics from the literature (Porian et al., 2024) to set global batch size and learning rate based on scaling factors. The hyperparameters that determine parameter count (layers, hidden dimension, number of heads, MLP dimension) were handpicked by OLMo developers for each scale... Appendix Table 2 details the configurations of all our models. |