reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Scaling Optimal LR Across Token Horizons

Authors: Johan Bjorck, Alon Benhaim, Vishrav Chaudhary, Furu Wei, Xia Song

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We conduct a large-scale empirical study on how optimal learning rate (LR) depends on the token horizon in LLM training. Our study is essentially a large ablation experiment where we vary LR and token horizon for a few different LLM models. We consider >250 training runs in total.
Researcher Affiliation	Industry	1Microsoft 2 Nvidia 3 Meta
Pseudocode	No	No pseudocode or algorithm blocks are explicitly labeled or formatted as such in the paper.
Open Source Code	No	The paper mentions that 'Experiments are run on the Megatron codebase (Shoeybi et al., 2019).' This refers to a third-party tool used, not an explicit statement or link for the authors' own open-source code for the methodology described in this paper.
Open Datasets	Yes	We use the Refined Web dataset (Penedo et al., 2023), a common-crawl derived dataset of roughly 600B tokens which is known to be of high quality (Penedo et al., 2024).
Dataset Splits	No	The paper describes experiments varying 'token horizon' and mentions 'final validation loss,' but it does not specify explicit training, validation, or test dataset splits (e.g., percentages, sample counts, or specific splitting methodologies) for reproducibility.
Hardware Specification	No	The paper mentions 'the recently operational Memphis super-cluster contains over 100,000 H100 GPUs' in the introduction as a general example for LLM training, but it does not specify the particular hardware (e.g., GPU models, CPU models, or specific cluster configurations) used for the experiments conducted in this study.
Software Dependencies	No	The paper mentions using 'the Megatron codebase (Shoeybi et al., 2019)' and 'Numpy and Scipy (Harris et al., 2020)' for curve fitting, but it does not provide specific version numbers for these or any other software components.
Experiment Setup	Yes	We use hyperparameters following GPT-3 weight decay of 0.1, gradient clipping of 1.0, and cosine learning decay schedule. The full list of hyperparameters can be viewed in Table 3 in Appendix A. Table 3 lists: 'weight decay 0.1', 'grad clip norm 1.0', 'LR schedule cosine', 'Adam β1 0.9', 'Adam β2 0.95', 'Context length 2048', 'Batch size (tokens) 524288', 'Warmup Steps max(1000, 0.01 train iters)', 'Min LR 0.1', 'Max LR'.