A Hitchhiker’s Guide to Scaling Law Estimation

Authors: Leshem Choshen, Yang Zhang, Jacob Andreas

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We collect (and release) a large-scale dataset containing losses and downstream evaluations for 485 previously published pretrained models. We use these to estimate more than 1,000 scaling laws, then derive a set of best practices for estimating scaling laws in new model families. We find that fitting scaling laws to intermediate checkpoints of training runs (and not just their final losses) substantially improves accuracy, and that all else equal estimates of performance are generally most accurate when derived from other models of similar sizes.
Researcher Affiliation Collaboration 1MIT 2MIT-IBM Watson AI Lab 3IBM Research. Correspondence to: Leshem Choshen <EMAIL>.
Pseudocode No The paper describes methods and procedures in prose, but does not contain any structured pseudocode or algorithm blocks.
Open Source Code Yes 1See our repository for code, data, and experimental results.
Open Datasets Yes As part of this work, we have collected and released the largest-scale public dataset describing scaling behavior across model families.
Dataset Splits Yes To evaluate estimated scaling laws reliably, we need to account for loss fluctuations during large-scale model training. Thus, we test against a few checkpoints near the end of training: we choose as target models Ftarget the 30%-maximal token family from the set F#tok>30% defined in the previous paragraph that is-that is, we take Ftarget = FP,#tok>30%.
Hardware Specification No The paper mentions 'computational cost (in FLOPs)' but does not specify any particular hardware (GPU, CPU, or specific cloud instances) used for their experiments.
Software Dependencies No Estimation of scaling law parameters uses the curve_fit function in scikit-learn (Pedregosa et al., 2011), with square loss.
Experiment Setup Yes All experiments in this paper use the widely used functional form proposed by Hoffmann et al. (2022): ˆL(f) := e E + e A #params(f)α + e B #toks(f)β. ... Estimation of scaling law parameters uses the curve_fit function in scikit-learn (Pedregosa et al., 2011), with square loss.