Scaling Optimal LR Across Token Horizons
Authors: Johan Bjorck, Alon Benhaim, Vishrav Chaudhary, Furu Wei, Xia Song
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We conduct a large-scale empirical study on how optimal learning rate (LR) depends on the token horizon in LLM training. Our study is essentially a large ablation experiment where we vary LR and token horizon for a few different LLM models. We consider >250 training runs in total. |
| Researcher Affiliation | Industry | 1Microsoft 2 Nvidia 3 Meta |
| Pseudocode | No | No pseudocode or algorithm blocks are explicitly labeled or formatted as such in the paper. |
| Open Source Code | No | The paper mentions that 'Experiments are run on the Megatron codebase (Shoeybi et al., 2019).' This refers to a third-party tool used, not an explicit statement or link for the authors' own open-source code for the methodology described in this paper. |
| Open Datasets | Yes | We use the Refined Web dataset (Penedo et al., 2023), a common-crawl derived dataset of roughly 600B tokens which is known to be of high quality (Penedo et al., 2024). |
| Dataset Splits | No | The paper describes experiments varying 'token horizon' and mentions 'final validation loss,' but it does not specify explicit training, validation, or test dataset splits (e.g., percentages, sample counts, or specific splitting methodologies) for reproducibility. |
| Hardware Specification | No | The paper mentions 'the recently operational Memphis super-cluster contains over 100,000 H100 GPUs' in the introduction as a general example for LLM training, but it does not specify the particular hardware (e.g., GPU models, CPU models, or specific cluster configurations) used for the experiments conducted in this study. |
| Software Dependencies | No | The paper mentions using 'the Megatron codebase (Shoeybi et al., 2019)' and 'Numpy and Scipy (Harris et al., 2020)' for curve fitting, but it does not provide specific version numbers for these or any other software components. |
| Experiment Setup | Yes | We use hyperparameters following GPT-3 weight decay of 0.1, gradient clipping of 1.0, and cosine learning decay schedule. The full list of hyperparameters can be viewed in Table 3 in Appendix A. Table 3 lists: 'weight decay 0.1', 'grad clip norm 1.0', 'LR schedule cosine', 'Adam β1 0.9', 'Adam β2 0.95', 'Context length 2048', 'Batch size (tokens) 524288', 'Warmup Steps max(1000, 0.01 train iters)', 'Min LR 0.1', 'Max LR'. |