reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Scaling Data-Constrained Language Models

Authors: Niklas Muennighoff, Alexander M. Rush, Boaz Barak, Teven Le Scao, Aleksandra Piktus, Nouamane Tazi, Sampo Pyysalo, Thomas Wolf, Colin Raffel

JMLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Speciﬁcally, we run a large set of experiments varying the extent of data repetition and compute budget, ranging up to 900 billion training tokens and 9 billion parameter models. ... Models and data sets from our 400 training runs are freely available at https://github.com/huggingface/datablations.
Researcher Affiliation	Collaboration	Niklas Muennighoﬀ EMAIL Hugging Face; Alexander M. Rush EMAIL Hugging Face; Boaz Barak EMAIL Harvard University; Sampo Pyysalo EMAIL University of Turku
Pseudocode	No	The paper includes mathematical derivations and formulas but no clearly labeled 'Pseudocode' or 'Algorithm' blocks.
Open Source Code	Yes	Models and data sets from our 400 training runs are freely available at https://github.com/huggingface/datablations. ... Appendix L. Release of Artifacts: We open-source all of our models and code under Apache 2.0 licenses. Our ﬁltered data sets are released with the same licenses as the data sets they stem from. All material can be found at: https://github.com/ huggingface/datablations.
Open Datasets	Yes	Models are trained on subsets of C4 (Raﬀel et al., 2020). The data constraints are carefully deﬁned to ensure maximal overlap as shown in Figure 2. ... To ensure our ﬁndings are not data set-dependent, we train models with the same conﬁgurations from Figure 9 on the OSCAR corpus (Ortiz Su arez et al., 2020).
Dataset Splits	Yes	Appendix D. Evaluation Details D.1 Loss Evaluation: For all models trained on C4, the ﬁnal test loss is computed on the same 210 million tokens from the C4 validation set after training. For held-out evaluation during training, such as in Figure 9, the conﬁgurations are displayed in Table 2.
Hardware Specification	Yes	Appendix I. Hyperparameters and Setup: Models are trained using data, tensor and pipeline parallelism on up to 256 AMD Instinct MI250X GPUs distributed across up to 64 nodes on the LUMI supercomputer located in Finland.
Software Dependencies	No	Appendix I. Hyperparameters and Setup: We have forked the Megatron-Deep Speed (Rasley et al., 2020; Smith et al., 2022) framework and adapted it for ROCm to enable training on AMD GPUs. While these software components are mentioned, specific version numbers are not provided.
Experiment Setup	Yes	Section 4. Experimental Setup: Following (Hoﬀmann et al., 2022) we use cosine learning rate schedules that decay 10 over the course of training for each model (diﬀerent schedules led to diﬀerent estimates in (Kaplan et al., 2020)). ... Models are trained using data, tensor and pipeline parallelism on up to 256 AMD Instinct MI250X GPUs...