Deconstructing What Makes a Good Optimizer for Autoregressive Language Models

Authors: Rosie Zhao, Depen Morwani, David Brandfonbrener, Nikhil Vyas, Sham Kakade

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our findings indicate that, except for SGD, these algorithms all perform comparably both in their optimal performance and also in terms of how they fare across a wide range of hyperparameter choices. Our results suggest to practitioners that the choice of optimizer can be guided by practical considerations like memory constraints and ease of implementation, as no single algorithm emerged as a clear winner in terms of performance or stability to hyperparameter misspecification. We aim to compare several optimization algorithms, including SGD, Adafactor, Adam, Lion, and Sophia in the context of autoregressive language modeling across a range of model sizes, hyperparameters, and architecture variants. We train decoder-only language models on C4 tokenized with the T5 tokenizer (Raffel et al., 2020) and report results in terms of validation loss.
Researcher Affiliation Academia Rosie Zhao Harvard University Depen Morwani Harvard University David Brandfonbrener Kempner Institute at Harvard University Nikhil Vyas Harvard University Sham Kakade Kempner Institute at Harvard University
Pseudocode Yes Algorithm 1: Adalayer Parameters: Learning rate η, exponential decay rates for the moment estimates β1, β2, number of steps T, ϵ while t T do for each layer l with p parameters do gl t l L(wt) ; vl t β2 vl t 1 + (1 β2) p 1/2 gl t 2 2 ; ml t β1 ml t 1 + (1 β1)gl t ; wl t+1 wl t η ml t
Open Source Code No The paper mentions using 'the standard Pytorch implementation of Adam W (Paszke et al., 2019), the timm implementation of SGDW (Wightman, 2019), and the OLMo implementation of Lion (Groeneveld et al., 2024)' and that they 'implement ourselves a modified version of Adafactor', but it does not provide an explicit statement or link for the source code of their own work.
Open Datasets Yes We train decoder-only language models on C4 tokenized with the T5 tokenizer (Raffel et al., 2020)
Dataset Splits No We train decoder-only language models on C4 tokenized with the T5 tokenizer (Raffel et al., 2020) and report results in terms of validation loss. We default to training models for the approximately chinchilla optimal number of tokens that is 20 times the number of parameters. Explicitly, this means for the 150m models we train for 25k steps or 3.3b tokens. The 300m models are trained for 50k steps, the 600m models are trained for 100k steps, the 1.2b models are trained for 200k steps, and the 150m-long models are also trained for 100k steps.
Hardware Specification No The paper mentions 'We train in mixed precision with bfloat16', but does not provide specific details such as GPU models, CPU types, or other hardware specifications used for running the experiments.
Software Dependencies No We use the standard Pytorch implementation of Adam W (Paszke et al., 2019), the timm implementation of SGDW (Wightman, 2019), and the OLMo implementation of Lion (Groeneveld et al., 2024).
Experiment Setup Yes We start from the OLMo codebase (Groeneveld et al., 2024) and train decoder-only transformer models of four sizes: 150m, 300m, 600m, and 1.2b... The models have widths of 1024, 1024, and 1408 and depths of 12, 24, 24. The MLP hidden dimension is 4x of the width. The activation function is Ge LU (Hendrycks & Gimpel, 2016). We use Ro PE positional encodings (Su et al., 2024). Attention heads are always dimension 64... We train in mixed precision with bfloat16. For all models, we use a batch size of 256 and sequence length of 512... We default to training models for the approximately chinchilla optimal number of tokens... We default to using 0 weight decay. We default to using a learning rate schedule with 10% of the training steps for warmup and then cosine decay with a minimum that is 10% of the maximum learning rate. We default to β2 = 0.95 and ϵ = 1e-15 following Wortsman et al. (2024).