Language models scale reliably with over-training and on downstream tasks

Authors: Samir Yitzhak Gadre, Georgios Smyrnis, Vaishaal Shankar, Suchin Gururangan, Mitchell Wortsman, Rulin Shao, Jean Mercat, Alex Fang, Jeffrey Li, Sedrick Keh, Rui Xin, Marianna Nezhurina, Igor Vasiljevic, Luca Soldaini, Jenia Jitsev, Alex Dimakis, Gabriel Ilharco, Pang Wei Koh, Shuran Song, Thomas Kollar, Yair Carmon, Achal Dave, Reinhard Heckel, Niklas Muennighoff, Ludwig Schmidt

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental To address both shortcomings, we create a testbed of 104 models with 0.011B to 6.9B parameters trained with various numbers of tokens on three data distributions. First, we fit scaling laws that extrapolate in both the amount of over-training and the number of model parameters. Second, we relate the perplexity of a language model to its downstream task performance by proposing a power law. To establish empirically that scaling extrapolates in the over-trained regime, we further experiment with a testbed of 104 models, trained from scratch on three different datasets: C4 (Raffel et al., 2019; Dodge et al., 2021), Red Pajama (Together Computer, 2023), and Refined Web (Penedo et al., 2023).
Researcher Affiliation Collaboration 1Columbia University, 2Toyota Research Institute, 3UT Austin, 4Apple, 5University of Washington, 6Juelich Supercomputing Center, Research Center Juelich, 7LAION, 8Allen Institute for AI, 9UC Berkeley, 10Bespoke Labs, 11Stanford University, 12Tel Aviv University, 13TU Munich, 14Contextual AI
Pseudocode No The paper describes the methodology using prose and mathematical equations but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code Yes To facilitate further research on reliable scaling, we provide all results of our experiments. Our experiments are available at https: //github.com/mlfoundations/scaling.
Open Datasets Yes To establish empirically that scaling extrapolates in the over-trained regime, we further experiment with a testbed of 104 models, trained from scratch on three different datasets: C4 (Raffel et al., 2019; Dodge et al., 2021), Red Pajama (Together Computer, 2023), and Refined Web (Penedo et al., 2023).
Dataset Splits No The paper mentions training on C4, Red Pajama, and Refined Web, and using 'C4 eval' and 'Open LM eval' as validation sets. However, it does not explicitly provide specific percentages, sample counts, or references to predefined splits for these datasets to enable reproducibility of the data partitioning.
Hardware Specification Yes We invest 100 A100 hours to train the models required to fit a scaling law for loss and 1,000 A100 hours for a corresponding law for downstream error.
Software Dependencies No The paper mentions the use of software such as PyTorch, Flash Attention, xFormers, GPT-Neo X, and SciPy, but it does not specify any version numbers for these software components.
Experiment Setup Yes We train transformers (Vaswani et al., 2017) for next token prediction, based on architectures like GPT-2 (Radford et al., 2019) and LLa MA (Touvron et al., 2023a). We employ GPT-Neo X (Black et al., 2022) as a standardized tokenizer for all data. We fix the learning rate (3e-3) for our sweeps. We train on 20 tokens per parameter (M = 20), which, in early experiments, gives models near the compute-optimal frontier.