reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

A Hitchhiker’s Guide to Scaling Law Estimation

Authors: Leshem Choshen, Yang Zhang, Jacob Andreas

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We collect (and release) a large-scale dataset containing losses and downstream evaluations for 485 previously published pretrained models. We use these to estimate more than 1,000 scaling laws, then derive a set of best practices for estimating scaling laws in new model families. We find that fitting scaling laws to intermediate checkpoints of training runs (and not just their final losses) substantially improves accuracy, and that all else equal estimates of performance are generally most accurate when derived from other models of similar sizes.
Researcher Affiliation	Collaboration	1MIT 2MIT-IBM Watson AI Lab 3IBM Research. Correspondence to: Leshem Choshen <EMAIL>.
Pseudocode	No	The paper describes methods and procedures in prose, but does not contain any structured pseudocode or algorithm blocks.
Open Source Code	Yes	1See our repository for code, data, and experimental results.
Open Datasets	Yes	As part of this work, we have collected and released the largest-scale public dataset describing scaling behavior across model families.
Dataset Splits	Yes	To evaluate estimated scaling laws reliably, we need to account for loss fluctuations during large-scale model training. Thus, we test against a few checkpoints near the end of training: we choose as target models Ftarget the 30%-maximal token family from the set F#tok>30% defined in the previous paragraph that is-that is, we take Ftarget = FP,#tok>30%.
Hardware Specification	No	The paper mentions 'computational cost (in FLOPs)' but does not specify any particular hardware (GPU, CPU, or specific cloud instances) used for their experiments.
Software Dependencies	No	Estimation of scaling law parameters uses the curve_fit function in scikit-learn (Pedregosa et al., 2011), with square loss.
Experiment Setup	Yes	All experiments in this paper use the widely used functional form proposed by Hoffmann et al. (2022): ˆL(f) := e E + e A #params(f)α + e B #toks(f)β. ... Estimation of scaling law parameters uses the curve_fit function in scikit-learn (Pedregosa et al., 2011), with square loss.