Loss-to-Loss Prediction: Scaling Laws for All Datasets

Authors: David Brandfonbrener, Nikhil Anand, Nikhil Vyas, Eran Malach, Sham M. Kakade

TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this paper, we derive a strategy for predicting one loss from another and apply it to predict across different pre-training datasets and from pre-training data to downstream task data. Our predictions extrapolate well even at 20x the largest FLOP budget used to fit the curves. More precisely, we find that there are simple shifted power law relationships between (1) the train losses of two models trained on two separate datasets when the models are paired by training compute (train-to-train), (2) the train loss and the test loss on any downstream distribution for a single model (train-to-test), and (3) the test losses of two models trained on two separate train datasets (test-to-test). The results hold up for pre-training datasets that differ substantially (some are entirely code and others have no code at all) and across a variety of downstream tasks. Finally, we find in some settings these shifted power law relationships can yield more accurate predictions than extrapolating single-dataset scaling laws. ... To facilitate our analysis, we pre-train models of varying size with varying flop budgets on 6 pre-training datasets... The total grid contains 528 models, or 88 models per training dataset.
Researcher Affiliation Academia David Brandfonbrener EMAIL Kempner Institute, Harvard University Nikhil Anand EMAIL Kempner Institute, Harvard University Nikhil Vyas EMAIL SEAS, Harvard University Eran Malach EMAIL Kempner Institute, Harvard University Sham Kakade EMAIL Kempner Institute and SEAS, Harvard University
Pseudocode No The paper contains mathematical equations, figures, and descriptions of methods, but no explicitly labeled 'Pseudocode' or 'Algorithm' sections, nor structured code-like blocks describing a procedure.
Open Source Code Yes 1Notebooks: https://github.com/Kempner Institute/loss-to-loss-notebooks, Training code: https://github.com/Kempner Institute/loss-to-loss-olmo, Models: https://huggingface.co/Kempner Institute AI/loss-to-loss
Open Datasets Yes To facilitate our analysis, we pre-train models of varying size with varying flop budgets on 6 pre-training datasets: Fine Web (Penedo et al., 2024), Fine Web-edu (Penedo et al., 2024), Proof Pile 2 (Azerbayev et al., 2023; Computer, 2023; Paster et al., 2023), Slim Pajama (Soboleva et al., 2023), Smol LM Corpus (Ben Allal et al., 2024), and Starcoder v1 (Li et al., 2023b).
Dataset Splits No The paper describes using 6 pre-training datasets and evaluating on 11 downstream tasks, but it does not provide specific percentages, absolute counts, or detailed methodologies for how the training, validation, or test data splits were created for these datasets. It mentions evaluating 'zero-shot on the downstream tasks' and 'on the OOD test set' but lacks detailed split information.
Hardware Specification No The paper mentions 'FLOP budgets for our sweep range from 2e17 to 4.84e19' and 'we train 6 larger models (one for each dataset) at a FLOP budget of 1e21', but it does not specify any particular GPU models, CPU types, or other hardware components used to achieve these FLOP budgets.
Software Dependencies No The paper mentions using the 'OLMo (Groeneveld et al., 2024) codebase' and 'PyTorch Layernorm' and 'Llama2' tokenizer, along with specific optimizer parameters like 'Adam', 'β1 0.9', 'β2 0.95', 'ϵ 1e-15'. However, it does not provide version numbers for PyTorch, OLMo codebase, or the Llama2 tokenizer, which are crucial for reproducibility.
Experiment Setup Yes We train all models using OLMo (Groeneveld et al., 2024) and generally follow hyperparameter settings from Wortsman et al. (2023); Zhao et al. (2024). Importantly, we use a linear warmup and cosine decay schedule for every run and only report the final performance... Full hyperparameters and details can be found in Appendix D. ... Table 7: Model parameters (Groeneveld et al., 2024; Wortsman et al., 2023; Zhao et al., 2024) Parameter Value n 6-24 for small models, 40 for the 3.3B model Number of heads n Head dimension 64 MLP hidden multiplier 4 Depth n Context length 512 Activation Ge LU Positional encoding Ro PE Biases False Normalization Py Torch Layernorm QK normalization True Precision Mixed, bfloat16 Tokenizer Llama2. Table 8: Training parameters (Groeneveld et al., 2024; Wortsman et al., 2023; Zhao et al., 2024) Parameter Value Optimizer Adam Batch size 1024 Learning rate 1e-3 Schedule Linear warmup, cosine decay Warmup steps 20% of total steps z-loss coefficient 1e-4 Weight decay 0.0 β1 0.9 β2 0.95 ϵ 1e-15