SOAP: Improving and Stabilizing Shampoo using Adam for Language Modeling

Authors: Nikhil Vyas, Depen Morwani, Rosie Zhao, Itai Shapira, David Brandfonbrener, Lucas Janson, Sham Kakade

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate SOAP on language model pre-training, with experiments on 360m and 660m sized models. In the large batch regime, SOAP reduces the number of iterations by over 40% and wall clock time by over 35% compared to Adam W, with approximately 20% improvements in both metrics compared to Shampoo. An implementation of SOAP is available at https://github.com/nikhilvyas/SOAP.
Researcher Affiliation Academia Nikhil Vyas Harvard University Depen Morwani Harvard University Rosie Zhao Harvard University Itai Shapira Harvard University David Brandfonbrener Kempner Institute at Harvard University Lucas Janson Harvard University Sham Kakade Kempner Institute at Harvard University
Pseudocode Yes Algorithm 1 Single step of idealized Shampoo with power 1/2. Algorithm 2 Single step of idealized Adafactor in Shampoo s eigenspace. Algorithm 3 Single step of SOAP for a m n layer. Algorithm 4 Eigenvectors function, implemented using power iteration and QR decomposition.
Open Source Code Yes An implementation of SOAP is available at https://github.com/nikhilvyas/SOAP.
Open Datasets Yes We train language models on C4 tokenized with the T5 tokenizer (Raffel et al., 2020) and report results in terms of validation loss.
Dataset Splits No The paper mentions 'validation loss', 'training data', and 'final test loss' implying splits are used, but it does not specify the exact percentages, sample counts, or methodology for these splits (e.g., '80/10/10 split'). The text 'We run SOAP on .5, .625, .75 and .875 fraction of the training data' refers to the duration of training, not the partitioning of the dataset into distinct sets.
Hardware Specification Yes At present, we perform these measurements on a single H100 GPU and utilize gradient accumulation to accommodate large batch sizes.
Software Dependencies No The paper mentions 'Pytorch' and specific functions like 'torch.linalg.eigh' and 'torch.linalg.qr'. It also refers to 'standard Pytorch implementation of Adam W (Paszke et al., 2019)' and 'Distributed Shampoo Shi et al. (2023) implementation of Shampoo'. However, it does not provide specific version numbers for Pytorch or any other software libraries used.
Experiment Setup Yes Default hyperparameters: We use β1 = 0.95, as we found it to outperform β1 = 0.9 in our sweeps for the 360m model. Following Wortsman et al. (2024) we use decoupled weight decay with coefficient 1e 4 and z-loss with coefficient 1e 4. We use the default value of ϵ = 1e 8 in Adam W (actual or when used for grafting), SOAP and Ga Lore. We use warmup followed by cosine decay as our scheduler. We start the warmup and end the cosine decay at 0.1x the maximum learning rate. Token counts. For all of our runs we use a sequence length of 1024. For all models (except in Section 6.3), we use a token batch size of 2048k 2m. We default to training models for the approximately chinchilla optimal (Hoffmann et al., 2022) number of tokens that is 20 times the number of parameters. Explicitly, this means for our default batch size of 2m, the 210m models are trained for 1600 steps or 3.3b tokens. The 360m models are trained for 3200 steps, the 660m models are trained for 6400 steps.