reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

SPAM: Spike-Aware Adam with Momentum Reset for Stable LLM Training

Authors: Tianjin Huang, Ziquan Zhu, Gaojie Jin, Lu Liu, Zhangyang Wang, Shiwei Liu

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments, including both pre-training and fine-tuning, demonstrate that SPAM consistently surpasses Adam and its variants across a range of model scales. Additionally, SPAM facilitates memory-efficient training by enabling sparse momentum, where only a subset of momentum terms are maintained and updated. When operating under memory constraints, SPAM outperforms state-of-the-art memory-efficient optimizers such as Ga Lore and Adam Mini. Our work underscores the importance of mitigating gradient spikes in LLM training and introduces an effective optimization strategy that enhances both training stability and resource efficiency at scale.
Researcher Affiliation	Academia	1Department of Computer Science, University of Exeter, Exeter, UK 2Department of Computer Science, University of Leicester, Leicester, UK 3Mathematical Institute, University of Oxford, Oxford, UK 4Department of Electrical and Computer Engineering, The University of Texas at Austin, Austin, US 5Department of Mathematics and Computer Science, Eindhoven University of Technology, Eindhoven, NL
Pseudocode	Yes	Pseudocode of SPAM is in Algorithm 1.
Open Source Code	Yes	Code is available at https://github.com/Tianjin Yellow/SPAM-Optimizer.git.
Open Datasets	Yes	Experiments were conducted on two datasets: the well-known C4 dataset (Raffel et al., 2020) and a cleaner high-quality dataset, Slim Pajama (Soboleva et al., 2023).
Dataset Splits	Yes	We report the training curves of various LLa MA models on the C4 dataset as well as the final perplexity in Figure 1 and Table 1, respectively. Overall, we observe that SPAM consistently achieves superior performance. ... We evaluate SPAM by specifying d% such that its memory usage, including both parameters and optimizer states, matches that of Galore. For Galore, Lo RA, and Re Lo RA baselines, we set the ranks r = 128, 256, 256, 512 for the 60M, 130M, 350M, and 1B models, respectively, following the setup in Galore (Zhao et al., 2024). The results in Table 3 show that SPAM consistently outperforms all the baselines by a good margin, demonstrating its effectiveness as a memory-efficient optimizer. ... In this section, we evaluate the effectiveness of SPAM for supervised fine-tuning. Following Li et al. (2024a), we fine-tune LLa MA2-7B on Commonsense170K (Hu et al., 2023) and test on 8 downstream tasks.
Hardware Specification	Yes	The runtime is measured by the average of 100 iterations under one H100 GPU.
Software Dependencies	No	The paper does not explicitly list specific software dependencies with version numbers (e.g., Python, PyTorch, CUDA versions). It mentions 'BF16 format' but this refers to data precision rather than a specific software version.
Experiment Setup	Yes	For SPAM, we set reset intervals T = 500, lr warmup step N = 150 and GSS threshold θ = 5000. Detailed descriptions of our task setups and hyperparameters are provided in the Appendix D. ... We use a max sequence length of 256 for all models, with a batch size of 512, with a batch size of 131K tokens. For all experiments, we adopt learning rate warmup of 1000 training steps, and use cosine annealing for the learning rate schedule, decaying to 10% of the initial learning rate. ... Table 8: Hyperparameters of SPAM for pre-training experiments in this paper. ... Table 9: Hyperparameters of SPAM for fine-tuning experiments in this paper.