SWAN: SGD with Normalization and Whitening Enables Stateless LLM Training

Authors: Chao Ma, Wenbo Gong, Meyer Scetbon, Edward Meeds

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Empirically, SWAN achieves 50% reduction on total end-to-end memory compared to Adam. Under the memory-efficienct LLa MA training benchmark of (Zhao et al., 2024a), SWAN reaches the same evaluation perplexity using half as many tokens for 350M and 1.3B model. The paper includes a dedicated section for "5. Experiments" with subsections such as "5.1. SWAN Performance on LLM Pre-training Tasks", "5.2. Ablation of SWAN on LLM pretraining", and "5.3. Memory Efficiency and Throughput Analysis", presenting tables and figures (Table 2, Figure 1, Figure 4, Figure 9) showing quantitative results and comparisons.
Researcher Affiliation Industry 1Microsoft Research. Correspondence to: Chao Ma <EMAIL>, Wenbo Gong <EMAIL>.
Pseudocode Yes Algorithm 1 SWAN Optimizer Algorithm 2 Grad Whitening Operator Algorithm 3 Grad Norm Operator
Open Source Code No Code will be released at github.com/microsoft/msr_optim
Open Datasets Yes We consider models with 60M, 130M, 350M, and 1.3B parameters, all trained on the C4 dataset (Raffel et al., 2020)
Dataset Splits Yes We consider models with 60M, 130M, 350M, and 1.3B parameters, all trained on the C4 dataset (Raffel et al., 2020) using an effective batch size of 130K tokens. Following the setup of (Zhao et al., 2024a)
Hardware Specification Yes We compare SWAN-0, Adam, and Galore on a single A100 GPU. We assess throughput when training a 1.3B LLama model on 8 A100 GPUs with a batch size of 130K.
Software Dependencies No Training uses BF16 by default unless specified, see Appendix K. The other evaluation settings follows Zhao et al. (2024a). For all SWAN variants, we use BF16 for model weights and gradients. For the Grad Whitening step of SWAN-0 and SWAN we use FP32 to whiten the BF16 gradients and then convert it back to BF16.
Experiment Setup Yes We consider models with 60M, 130M, 350M, and 1.3B parameters, all trained on the C4 dataset (Raffel et al., 2020) using an effective batch size of 130K tokens. Following the setup of (Zhao et al., 2024a), SWAN is applied to all linear modules in both attention and MLP blocks. Training uses BF16 by default unless specified, see Appendix K. SWAN-0... uses naive NS-iteration for whitening, disabled learning rate warmup, and use similar learning rates optimized for Adam. SWAN ... enabled learning rate warmup, and allowed the use of optimized learning rates that largely differ from Adam. SWAN ... employs the proposed NSDS scheme for fast whitening (section 3.2). For Adam we ... use β1 = 0.9, β2 = 0.99, and no weight decay.