Scaling Optimal LR Across Token Horizons

Authors: Johan Bjorck, Alon Benhaim, Vishrav Chaudhary, Furu Wei, Xia Song

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We conduct a large-scale empirical study on how optimal learning rate (LR) depends on the token horizon in LLM training. Our study is essentially a large ablation experiment where we vary LR and token horizon for a few different LLM models. We consider >250 training runs in total.
Researcher Affiliation Industry 1Microsoft 2 Nvidia 3 Meta
Pseudocode No No pseudocode or algorithm blocks are explicitly labeled or formatted as such in the paper.
Open Source Code No The paper mentions that 'Experiments are run on the Megatron codebase (Shoeybi et al., 2019).' This refers to a third-party tool used, not an explicit statement or link for the authors' own open-source code for the methodology described in this paper.
Open Datasets Yes We use the Refined Web dataset (Penedo et al., 2023), a common-crawl derived dataset of roughly 600B tokens which is known to be of high quality (Penedo et al., 2024).
Dataset Splits No The paper describes experiments varying 'token horizon' and mentions 'final validation loss,' but it does not specify explicit training, validation, or test dataset splits (e.g., percentages, sample counts, or specific splitting methodologies) for reproducibility.
Hardware Specification No The paper mentions 'the recently operational Memphis super-cluster contains over 100,000 H100 GPUs' in the introduction as a general example for LLM training, but it does not specify the particular hardware (e.g., GPU models, CPU models, or specific cluster configurations) used for the experiments conducted in this study.
Software Dependencies No The paper mentions using 'the Megatron codebase (Shoeybi et al., 2019)' and 'Numpy and Scipy (Harris et al., 2020)' for curve fitting, but it does not provide specific version numbers for these or any other software components.
Experiment Setup Yes We use hyperparameters following GPT-3 weight decay of 0.1, gradient clipping of 1.0, and cosine learning decay schedule. The full list of hyperparameters can be viewed in Table 3 in Appendix A. Table 3 lists: 'weight decay 0.1', 'grad clip norm 1.0', 'LR schedule cosine', 'Adam β1 0.9', 'Adam β2 0.95', 'Context length 2048', 'Batch size (tokens) 524288', 'Warmup Steps max(1000, 0.01 train iters)', 'Min LR 0.1', 'Max LR'.