reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Language models scale reliably with over-training and on downstream tasks

Authors: Samir Yitzhak Gadre, Georgios Smyrnis, Vaishaal Shankar, Suchin Gururangan, Mitchell Wortsman, Rulin Shao, Jean Mercat, Alex Fang, Jeffrey Li, Sedrick Keh, Rui Xin, Marianna Nezhurina, Igor Vasiljevic, Luca Soldaini, Jenia Jitsev, Alex Dimakis, Gabriel Ilharco, Pang Wei Koh, Shuran Song, Thomas Kollar, Yair Carmon, Achal Dave, Reinhard Heckel, Niklas Muennighoff, Ludwig Schmidt

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	To address both shortcomings, we create a testbed of 104 models with 0.011B to 6.9B parameters trained with various numbers of tokens on three data distributions. First, we ﬁt scaling laws that extrapolate in both the amount of over-training and the number of model parameters. Second, we relate the perplexity of a language model to its downstream task performance by proposing a power law. To establish empirically that scaling extrapolates in the over-trained regime, we further experiment with a testbed of 104 models, trained from scratch on three different datasets: C4 (Raffel et al., 2019; Dodge et al., 2021), Red Pajama (Together Computer, 2023), and Reﬁned Web (Penedo et al., 2023).
Researcher Affiliation	Collaboration	1Columbia University, 2Toyota Research Institute, 3UT Austin, 4Apple, 5University of Washington, 6Juelich Supercomputing Center, Research Center Juelich, 7LAION, 8Allen Institute for AI, 9UC Berkeley, 10Bespoke Labs, 11Stanford University, 12Tel Aviv University, 13TU Munich, 14Contextual AI
Pseudocode	No	The paper describes the methodology using prose and mathematical equations but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code	Yes	To facilitate further research on reliable scaling, we provide all results of our experiments. Our experiments are available at https: //github.com/mlfoundations/scaling.
Open Datasets	Yes	To establish empirically that scaling extrapolates in the over-trained regime, we further experiment with a testbed of 104 models, trained from scratch on three different datasets: C4 (Raffel et al., 2019; Dodge et al., 2021), Red Pajama (Together Computer, 2023), and Reﬁned Web (Penedo et al., 2023).
Dataset Splits	No	The paper mentions training on C4, Red Pajama, and Refined Web, and using 'C4 eval' and 'Open LM eval' as validation sets. However, it does not explicitly provide specific percentages, sample counts, or references to predefined splits for these datasets to enable reproducibility of the data partitioning.
Hardware Specification	Yes	We invest 100 A100 hours to train the models required to ﬁt a scaling law for loss and 1,000 A100 hours for a corresponding law for downstream error.
Software Dependencies	No	The paper mentions the use of software such as PyTorch, Flash Attention, xFormers, GPT-Neo X, and SciPy, but it does not specify any version numbers for these software components.
Experiment Setup	Yes	We train transformers (Vaswani et al., 2017) for next token prediction, based on architectures like GPT-2 (Radford et al., 2019) and LLa MA (Touvron et al., 2023a). We employ GPT-Neo X (Black et al., 2022) as a standardized tokenizer for all data. We ﬁx the learning rate (3e-3) for our sweeps. We train on 20 tokens per parameter (M = 20), which, in early experiments, gives models near the compute-optimal frontier.