Scaling Data-Constrained Language Models
Authors: Niklas Muennighoff, Alexander M. Rush, Boaz Barak, Teven Le Scao, Aleksandra Piktus, Nouamane Tazi, Sampo Pyysalo, Thomas Wolf, Colin Raffel
JMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Specifically, we run a large set of experiments varying the extent of data repetition and compute budget, ranging up to 900 billion training tokens and 9 billion parameter models. ... Models and data sets from our 400 training runs are freely available at https://github.com/huggingface/datablations. |
| Researcher Affiliation | Collaboration | Niklas Muennighoff EMAIL Hugging Face; Alexander M. Rush EMAIL Hugging Face; Boaz Barak EMAIL Harvard University; Sampo Pyysalo EMAIL University of Turku |
| Pseudocode | No | The paper includes mathematical derivations and formulas but no clearly labeled 'Pseudocode' or 'Algorithm' blocks. |
| Open Source Code | Yes | Models and data sets from our 400 training runs are freely available at https://github.com/huggingface/datablations. ... Appendix L. Release of Artifacts: We open-source all of our models and code under Apache 2.0 licenses. Our filtered data sets are released with the same licenses as the data sets they stem from. All material can be found at: https://github.com/ huggingface/datablations. |
| Open Datasets | Yes | Models are trained on subsets of C4 (Raffel et al., 2020). The data constraints are carefully defined to ensure maximal overlap as shown in Figure 2. ... To ensure our findings are not data set-dependent, we train models with the same configurations from Figure 9 on the OSCAR corpus (Ortiz Su arez et al., 2020). |
| Dataset Splits | Yes | Appendix D. Evaluation Details D.1 Loss Evaluation: For all models trained on C4, the final test loss is computed on the same 210 million tokens from the C4 validation set after training. For held-out evaluation during training, such as in Figure 9, the configurations are displayed in Table 2. |
| Hardware Specification | Yes | Appendix I. Hyperparameters and Setup: Models are trained using data, tensor and pipeline parallelism on up to 256 AMD Instinct MI250X GPUs distributed across up to 64 nodes on the LUMI supercomputer located in Finland. |
| Software Dependencies | No | Appendix I. Hyperparameters and Setup: We have forked the Megatron-Deep Speed (Rasley et al., 2020; Smith et al., 2022) framework and adapted it for ROCm to enable training on AMD GPUs. While these software components are mentioned, specific version numbers are not provided. |
| Experiment Setup | Yes | Section 4. Experimental Setup: Following (Hoffmann et al., 2022) we use cosine learning rate schedules that decay 10 over the course of training for each model (different schedules led to different estimates in (Kaplan et al., 2020)). ... Models are trained using data, tensor and pipeline parallelism on up to 256 AMD Instinct MI250X GPUs... |