SOAP: Improving and Stabilizing Shampoo using Adam for Language Modeling
Authors: Nikhil Vyas, Depen Morwani, Rosie Zhao, Itai Shapira, David Brandfonbrener, Lucas Janson, Sham Kakade
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate SOAP on language model pre-training, with experiments on 360m and 660m sized models. In the large batch regime, SOAP reduces the number of iterations by over 40% and wall clock time by over 35% compared to Adam W, with approximately 20% improvements in both metrics compared to Shampoo. An implementation of SOAP is available at https://github.com/nikhilvyas/SOAP. |
| Researcher Affiliation | Academia | Nikhil Vyas Harvard University Depen Morwani Harvard University Rosie Zhao Harvard University Itai Shapira Harvard University David Brandfonbrener Kempner Institute at Harvard University Lucas Janson Harvard University Sham Kakade Kempner Institute at Harvard University |
| Pseudocode | Yes | Algorithm 1 Single step of idealized Shampoo with power 1/2. Algorithm 2 Single step of idealized Adafactor in Shampoo s eigenspace. Algorithm 3 Single step of SOAP for a m n layer. Algorithm 4 Eigenvectors function, implemented using power iteration and QR decomposition. |
| Open Source Code | Yes | An implementation of SOAP is available at https://github.com/nikhilvyas/SOAP. |
| Open Datasets | Yes | We train language models on C4 tokenized with the T5 tokenizer (Raffel et al., 2020) and report results in terms of validation loss. |
| Dataset Splits | No | The paper mentions 'validation loss', 'training data', and 'final test loss' implying splits are used, but it does not specify the exact percentages, sample counts, or methodology for these splits (e.g., '80/10/10 split'). The text 'We run SOAP on .5, .625, .75 and .875 fraction of the training data' refers to the duration of training, not the partitioning of the dataset into distinct sets. |
| Hardware Specification | Yes | At present, we perform these measurements on a single H100 GPU and utilize gradient accumulation to accommodate large batch sizes. |
| Software Dependencies | No | The paper mentions 'Pytorch' and specific functions like 'torch.linalg.eigh' and 'torch.linalg.qr'. It also refers to 'standard Pytorch implementation of Adam W (Paszke et al., 2019)' and 'Distributed Shampoo Shi et al. (2023) implementation of Shampoo'. However, it does not provide specific version numbers for Pytorch or any other software libraries used. |
| Experiment Setup | Yes | Default hyperparameters: We use β1 = 0.95, as we found it to outperform β1 = 0.9 in our sweeps for the 360m model. Following Wortsman et al. (2024) we use decoupled weight decay with coefficient 1e 4 and z-loss with coefficient 1e 4. We use the default value of ϵ = 1e 8 in Adam W (actual or when used for grafting), SOAP and Ga Lore. We use warmup followed by cosine decay as our scheduler. We start the warmup and end the cosine decay at 0.1x the maximum learning rate. Token counts. For all of our runs we use a sequence length of 1024. For all models (except in Section 6.3), we use a token batch size of 2048k 2m. We default to training models for the approximately chinchilla optimal (Hoffmann et al., 2022) number of tokens that is 20 times the number of parameters. Explicitly, this means for our default batch size of 2m, the 210m models are trained for 1600 steps or 3.3b tokens. The 360m models are trained for 3200 steps, the 660m models are trained for 6400 steps. |