reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

SWAN: SGD with Normalization and Whitening Enables Stateless LLM Training

Authors: Chao Ma, Wenbo Gong, Meyer Scetbon, Edward Meeds

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Empirically, SWAN achieves 50% reduction on total end-to-end memory compared to Adam. Under the memory-efficienct LLa MA training benchmark of (Zhao et al., 2024a), SWAN reaches the same evaluation perplexity using half as many tokens for 350M and 1.3B model. The paper includes a dedicated section for "5. Experiments" with subsections such as "5.1. SWAN Performance on LLM Pre-training Tasks", "5.2. Ablation of SWAN on LLM pretraining", and "5.3. Memory Efficiency and Throughput Analysis", presenting tables and figures (Table 2, Figure 1, Figure 4, Figure 9) showing quantitative results and comparisons.
Researcher Affiliation	Industry	1Microsoft Research. Correspondence to: Chao Ma <EMAIL>, Wenbo Gong <EMAIL>.
Pseudocode	Yes	Algorithm 1 SWAN Optimizer Algorithm 2 Grad Whitening Operator Algorithm 3 Grad Norm Operator
Open Source Code	No	Code will be released at github.com/microsoft/msr_optim
Open Datasets	Yes	We consider models with 60M, 130M, 350M, and 1.3B parameters, all trained on the C4 dataset (Raffel et al., 2020)
Dataset Splits	Yes	We consider models with 60M, 130M, 350M, and 1.3B parameters, all trained on the C4 dataset (Raffel et al., 2020) using an effective batch size of 130K tokens. Following the setup of (Zhao et al., 2024a)
Hardware Specification	Yes	We compare SWAN-0, Adam, and Galore on a single A100 GPU. We assess throughput when training a 1.3B LLama model on 8 A100 GPUs with a batch size of 130K.
Software Dependencies	No	Training uses BF16 by default unless specified, see Appendix K. The other evaluation settings follows Zhao et al. (2024a). For all SWAN variants, we use BF16 for model weights and gradients. For the Grad Whitening step of SWAN-0 and SWAN we use FP32 to whiten the BF16 gradients and then convert it back to BF16.
Experiment Setup	Yes	We consider models with 60M, 130M, 350M, and 1.3B parameters, all trained on the C4 dataset (Raffel et al., 2020) using an effective batch size of 130K tokens. Following the setup of (Zhao et al., 2024a), SWAN is applied to all linear modules in both attention and MLP blocks. Training uses BF16 by default unless specified, see Appendix K. SWAN-0... uses naive NS-iteration for whitening, disabled learning rate warmup, and use similar learning rates optimized for Adam. SWAN ... enabled learning rate warmup, and allowed the use of optimized learning rates that largely differ from Adam. SWAN ... employs the proposed NSDS scheme for fast whitening (section 3.2). For Adam we ... use β1 = 0.9, β2 = 0.99, and no weight decay.