On the Performance Analysis of Momentum Method: A Frequency Domain Perspective
Authors: Xianliang Li, Jun Luo, Zhiwei Zheng, Hanxiao Wang, Li Luo, Lingkun Wen, Linlong Wu, Sheng Xu
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our experiments support this perspective and provide a deeper understanding of the mechanism involved. Moreover, our analysis reveals the following significant findings: high-frequency gradient components are undesired in the late stages of training; preserving the original gradient in the early stages, and gradually amplifying low-frequency gradient components during training both enhance performance. Based on these insights, we propose Frequency Stochastic Gradient Descent with Momentum (FSGDM), a heuristic optimizer that dynamically adjusts the momentum filtering characteristic with an empirically effective dynamic magnitude response. Experimental results demonstrate the superiority of FSGDM over conventional momentum optimizers. (Page 1) ... In this section, we present an empirical study to discover the influence of the momentum coefficients by comparing the test performance on momentum systems with different dynamic magnitude responses. We train VGG (Simonyan & Zisserman, 2014) on the CIFAR-10 (Krizhevsky et al., 2009) dataset and Res Net50 (He et al., 2016) on the CIFAR-100 dataset using different momentum coefficients, while keeping all other hyperparameters unchanged. For each experiment, we report the mean and standard error (as subscripts) of test accuracy for 3 runs with random seeds from 0-2. The detailed experimental settings can be found in Appendix D. The experimental results in CIFAR10 show high similarity to those in CIFAR-100. Thus, here, we mainly focus on the analysis based on CIFAR-100 and defer the experimental results of VGG16 on CIFAR-10 in Appendix C.3. (Page 4) |
| Researcher Affiliation | Academia | Xianliang Li 1,2, Jun Luo 1,2, Zhiwei Zheng 3, Hanxiao Wang2,4, Li Luo5, Lingkun Wen2,6, Linlong Wu7, Sheng Xu 1 1Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences 2University of Chinese Academy of Sciences 3University of California, Berkeley 4Institute of Automation, Chinese Academy of Sciences 5Sun Yat-sen University 6Shanghai Astronomical Observatory, Chinese Academy of Sciences 7University of Luxembourg EMAIL, EMAIL, EMAIL, EMAIL, EMAIL, EMAIL, EMAIL |
| Pseudocode | Yes | Algorithm 1: FSGDM Input: Σ, c, v, N; Initialization: m0, µ = cΣ, δ = Σ/N; for each t = 1, 2, . . . do gt = Lt(xt 1, ζt 1); u(t) = t t+µ, ut = u( t/δ δ); mt = utmt 1 + vgt; xt = xt 1 αtmt; end |
| Open Source Code | Yes | Our implementation of FSGDM is available at https://github.com/yinleung/FSGDM. |
| Open Datasets | Yes | We train VGG (Simonyan & Zisserman, 2014) on the CIFAR-10 (Krizhevsky et al., 2009) dataset and Res Net50 (He et al., 2016) on the CIFAR-100 dataset... (Page 4) ...Tiny-Image Net (Le & Yang, 2015)... (Page 7) ...ILSVRC 2012 Image Net Russakovsky et al. (2015). (Page 7) ...IWSLT14 German-English translation task (Cettolo et al., 2014)... (Page 8) ...Walked2d-v4, Half Cheetah-v4, and Ant-V4, which are continuous control environments simulated by the standard and widely-used engine, Mu Jo Co (Todorov et al., 2012). (Page 8) |
| Dataset Splits | No | The paper does not explicitly provide specific dataset split percentages, sample counts, or citations to predefined splits. It mentions using standard datasets like CIFAR-10, CIFAR-100, Tiny-Image Net, and Image Net, which commonly have predefined splits, but it does not state what those splits are in the text. |
| Hardware Specification | Yes | All experiments are conducted on RTX 4090 or A100 GPUs. (Page 18) ...We train all models for 100 epochs using a single NVIDIA RTX 4090 GPU. (Page 18) |
| Software Dependencies | No | The paper mentions using "Py Torch tutorial code" (Page 18), "Fair Seq framework" (Page 18), and "Tianshou codebase (Weng et al., 2022)" (Page 8). However, it does not provide specific version numbers for any of these software components or libraries. |
| Experiment Setup | Yes | We choose the Cosine Annealing LR (Loshchilov & Hutter, 2016) as our training scheduler. Additionally, we set the learning rate as 1e-1 for all experiments, while the weight decay is set as 5e-4 for experiments on CIFAR-10, CIFAR-100, and Tiny-Image Net, and 1e-1 for Image Net. All models we used are simply following their paper s original architecture, and adopt the weight initialization introduced by He et al. (2015). Additionally, we train 300 epochs for experiments on CIFAR10 and CIFAR-100 and train 100 epochs for Tiny-Image Net and Image Net. We use a 128 batch size for experiments on CIFAR-10, CIFAR-100, and Tiny-Image Net, and 256 for Image Net. (Page 18) ...We set the maximum batch size to 4,096 tokens and apply gradient clipping with a threshold of 0.1. The baseline learning rate is set to 0.25, and for the optimizer, we use a weight decay of 0.0001. (Page 18) ...we searched for suitable learning rates across the three games, ultimately setting 10e-2, 10e-2 and 10e-3 for Walker2d-v4, Half Cheetah-v4, and Ant-v4, respectively. (Page 18) |