SPAM: Spike-Aware Adam with Momentum Reset for Stable LLM Training
Authors: Tianjin Huang, Ziquan Zhu, Gaojie Jin, Lu Liu, Zhangyang Wang, Shiwei Liu
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments, including both pre-training and fine-tuning, demonstrate that SPAM consistently surpasses Adam and its variants across a range of model scales. Additionally, SPAM facilitates memory-efficient training by enabling sparse momentum, where only a subset of momentum terms are maintained and updated. When operating under memory constraints, SPAM outperforms state-of-the-art memory-efficient optimizers such as Ga Lore and Adam Mini. Our work underscores the importance of mitigating gradient spikes in LLM training and introduces an effective optimization strategy that enhances both training stability and resource efficiency at scale. |
| Researcher Affiliation | Academia | 1Department of Computer Science, University of Exeter, Exeter, UK 2Department of Computer Science, University of Leicester, Leicester, UK 3Mathematical Institute, University of Oxford, Oxford, UK 4Department of Electrical and Computer Engineering, The University of Texas at Austin, Austin, US 5Department of Mathematics and Computer Science, Eindhoven University of Technology, Eindhoven, NL |
| Pseudocode | Yes | Pseudocode of SPAM is in Algorithm 1. |
| Open Source Code | Yes | Code is available at https://github.com/Tianjin Yellow/SPAM-Optimizer.git. |
| Open Datasets | Yes | Experiments were conducted on two datasets: the well-known C4 dataset (Raffel et al., 2020) and a cleaner high-quality dataset, Slim Pajama (Soboleva et al., 2023). |
| Dataset Splits | Yes | We report the training curves of various LLa MA models on the C4 dataset as well as the final perplexity in Figure 1 and Table 1, respectively. Overall, we observe that SPAM consistently achieves superior performance. ... We evaluate SPAM by specifying d% such that its memory usage, including both parameters and optimizer states, matches that of Galore. For Galore, Lo RA, and Re Lo RA baselines, we set the ranks r = 128, 256, 256, 512 for the 60M, 130M, 350M, and 1B models, respectively, following the setup in Galore (Zhao et al., 2024). The results in Table 3 show that SPAM consistently outperforms all the baselines by a good margin, demonstrating its effectiveness as a memory-efficient optimizer. ... In this section, we evaluate the effectiveness of SPAM for supervised fine-tuning. Following Li et al. (2024a), we fine-tune LLa MA2-7B on Commonsense170K (Hu et al., 2023) and test on 8 downstream tasks. |
| Hardware Specification | Yes | The runtime is measured by the average of 100 iterations under one H100 GPU. |
| Software Dependencies | No | The paper does not explicitly list specific software dependencies with version numbers (e.g., Python, PyTorch, CUDA versions). It mentions 'BF16 format' but this refers to data precision rather than a specific software version. |
| Experiment Setup | Yes | For SPAM, we set reset intervals T = 500, lr warmup step N = 150 and GSS threshold θ = 5000. Detailed descriptions of our task setups and hyperparameters are provided in the Appendix D. ... We use a max sequence length of 256 for all models, with a batch size of 512, with a batch size of 131K tokens. For all experiments, we adopt learning rate warmup of 1000 training steps, and use cosine annealing for the learning rate schedule, decaying to 10% of the initial learning rate. ... Table 8: Hyperparameters of SPAM for pre-training experiments in this paper. ... Table 9: Hyperparameters of SPAM for fine-tuning experiments in this paper. |