Language models scale reliably with over-training and on downstream tasks
Authors: Samir Yitzhak Gadre, Georgios Smyrnis, Vaishaal Shankar, Suchin Gururangan, Mitchell Wortsman, Rulin Shao, Jean Mercat, Alex Fang, Jeffrey Li, Sedrick Keh, Rui Xin, Marianna Nezhurina, Igor Vasiljevic, Luca Soldaini, Jenia Jitsev, Alex Dimakis, Gabriel Ilharco, Pang Wei Koh, Shuran Song, Thomas Kollar, Yair Carmon, Achal Dave, Reinhard Heckel, Niklas Muennighoff, Ludwig Schmidt
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | To address both shortcomings, we create a testbed of 104 models with 0.011B to 6.9B parameters trained with various numbers of tokens on three data distributions. First, we fit scaling laws that extrapolate in both the amount of over-training and the number of model parameters. Second, we relate the perplexity of a language model to its downstream task performance by proposing a power law. To establish empirically that scaling extrapolates in the over-trained regime, we further experiment with a testbed of 104 models, trained from scratch on three different datasets: C4 (Raffel et al., 2019; Dodge et al., 2021), Red Pajama (Together Computer, 2023), and Refined Web (Penedo et al., 2023). |
| Researcher Affiliation | Collaboration | 1Columbia University, 2Toyota Research Institute, 3UT Austin, 4Apple, 5University of Washington, 6Juelich Supercomputing Center, Research Center Juelich, 7LAION, 8Allen Institute for AI, 9UC Berkeley, 10Bespoke Labs, 11Stanford University, 12Tel Aviv University, 13TU Munich, 14Contextual AI |
| Pseudocode | No | The paper describes the methodology using prose and mathematical equations but does not include any explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | To facilitate further research on reliable scaling, we provide all results of our experiments. Our experiments are available at https: //github.com/mlfoundations/scaling. |
| Open Datasets | Yes | To establish empirically that scaling extrapolates in the over-trained regime, we further experiment with a testbed of 104 models, trained from scratch on three different datasets: C4 (Raffel et al., 2019; Dodge et al., 2021), Red Pajama (Together Computer, 2023), and Refined Web (Penedo et al., 2023). |
| Dataset Splits | No | The paper mentions training on C4, Red Pajama, and Refined Web, and using 'C4 eval' and 'Open LM eval' as validation sets. However, it does not explicitly provide specific percentages, sample counts, or references to predefined splits for these datasets to enable reproducibility of the data partitioning. |
| Hardware Specification | Yes | We invest 100 A100 hours to train the models required to fit a scaling law for loss and 1,000 A100 hours for a corresponding law for downstream error. |
| Software Dependencies | No | The paper mentions the use of software such as PyTorch, Flash Attention, xFormers, GPT-Neo X, and SciPy, but it does not specify any version numbers for these software components. |
| Experiment Setup | Yes | We train transformers (Vaswani et al., 2017) for next token prediction, based on architectures like GPT-2 (Radford et al., 2019) and LLa MA (Touvron et al., 2023a). We employ GPT-Neo X (Black et al., 2022) as a standardized tokenizer for all data. We fix the learning rate (3e-3) for our sweeps. We train on 20 tokens per parameter (M = 20), which, in early experiments, gives models near the compute-optimal frontier. |