SWAN: SGD with Normalization and Whitening Enables Stateless LLM Training
Authors: Chao Ma, Wenbo Gong, Meyer Scetbon, Edward Meeds
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Empirically, SWAN achieves 50% reduction on total end-to-end memory compared to Adam. Under the memory-efficienct LLa MA training benchmark of (Zhao et al., 2024a), SWAN reaches the same evaluation perplexity using half as many tokens for 350M and 1.3B model. The paper includes a dedicated section for "5. Experiments" with subsections such as "5.1. SWAN Performance on LLM Pre-training Tasks", "5.2. Ablation of SWAN on LLM pretraining", and "5.3. Memory Efficiency and Throughput Analysis", presenting tables and figures (Table 2, Figure 1, Figure 4, Figure 9) showing quantitative results and comparisons. |
| Researcher Affiliation | Industry | 1Microsoft Research. Correspondence to: Chao Ma <EMAIL>, Wenbo Gong <EMAIL>. |
| Pseudocode | Yes | Algorithm 1 SWAN Optimizer Algorithm 2 Grad Whitening Operator Algorithm 3 Grad Norm Operator |
| Open Source Code | No | Code will be released at github.com/microsoft/msr_optim |
| Open Datasets | Yes | We consider models with 60M, 130M, 350M, and 1.3B parameters, all trained on the C4 dataset (Raffel et al., 2020) |
| Dataset Splits | Yes | We consider models with 60M, 130M, 350M, and 1.3B parameters, all trained on the C4 dataset (Raffel et al., 2020) using an effective batch size of 130K tokens. Following the setup of (Zhao et al., 2024a) |
| Hardware Specification | Yes | We compare SWAN-0, Adam, and Galore on a single A100 GPU. We assess throughput when training a 1.3B LLama model on 8 A100 GPUs with a batch size of 130K. |
| Software Dependencies | No | Training uses BF16 by default unless specified, see Appendix K. The other evaluation settings follows Zhao et al. (2024a). For all SWAN variants, we use BF16 for model weights and gradients. For the Grad Whitening step of SWAN-0 and SWAN we use FP32 to whiten the BF16 gradients and then convert it back to BF16. |
| Experiment Setup | Yes | We consider models with 60M, 130M, 350M, and 1.3B parameters, all trained on the C4 dataset (Raffel et al., 2020) using an effective batch size of 130K tokens. Following the setup of (Zhao et al., 2024a), SWAN is applied to all linear modules in both attention and MLP blocks. Training uses BF16 by default unless specified, see Appendix K. SWAN-0... uses naive NS-iteration for whitening, disabled learning rate warmup, and use similar learning rates optimized for Adam. SWAN ... enabled learning rate warmup, and allowed the use of optimized learning rates that largely differ from Adam. SWAN ... employs the proposed NSDS scheme for fast whitening (section 3.2). For Adam we ... use β1 = 0.9, β2 = 0.99, and no weight decay. |