AdaR: An Adaptive Gradient Method with Cyclical Restarting of Moment Estimations

Authors: Yangchuan Wang, Lianhong Ding, Peng Shi

IJCAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Empirically, Ada R outperforms state-of-the-art optimization algorithms on image classification and language modeling tasks, demonstrating superior generalization and faster convergence. We conduct experiments on image classification and language modeling tasks.
Researcher Affiliation Academia Yangchuan Wang1 , Lianhong Ding1 and Peng Shi 2 Q 1Beijing Wuzi University 2University of Science and Technology Beijing EMAIL, EMAIL, EMAIL
Pseudocode Yes To address the long-tail gradient issue, this section provides the detail of Ada R, with its pseudocode presented in Algorithm 1. The method divides each training epoch into fixed-iteration intervals, where it restarts moment estimations and accumulates the most recent gradients. ... Algorithm 1 ADAptive gradient methods via Restarting moment estimations (Ada R)
Open Source Code Yes 1 Code at https://github.com/tHappo/Ada R
Open Datasets Yes For convex optimization, we train logistic regression model [La Valley, 2008] on the MNIST dataset [Le Cun et al., 1998]. For non-convex tasks, we train Vgg Net-11 [Simonyan and Zisserman, 2015], Res Net34 [He et al., 2016], and Dense Net-121 [Huang et al., 2017] on the CIFAR-10 and CIFAR-100 datasets [Krizhevsky, 2009], as well as Res Net-18 [He et al., 2016] on the Tiny Image Net dataset. In language modeling, we train 1-layer LSTM [Merity et al., 2018] on the Penn Treebank (PTB) dataset [Marcus et al., 1993] and Transformer [Vaswani et al., 2017] on the Wikitext-2 dataset [Bojanowski et al., 2017].
Dataset Splits Yes MNIST consists of 60,000 training and 10,000 test samples of 28 28 grayscale images of handwritten digits. CIFAR-10 comprises 50,000 training and 10,000 test 32 32 color images across 10 classes. CIFAR-100 contains 50,000 training and 10,000 test 32 32 pixel images across 100 classes. Tiny Image Net contains 200 classes, each with 500 training, 25 validation, and 25 test images, resizing from 64 64 to 224 224 pixels. Penn Treebank (PTB) comprises 0.93 million training, 0.073 million validation, and 0.082 million testing tokens. Wikitext-2 includes 1.9 million training, 0.17 million validation, and 0.19 million testing tokens.
Hardware Specification Yes All experiments are conducted on an NVIDIA RTX A4000 GPU (16 GB) and an AMD EPYC 7551P CPU, using Python 3.8 and Py Torch library [Paszke, 2019].
Software Dependencies No All experiments are conducted on an NVIDIA RTX A4000 GPU (16 GB) and an AMD EPYC 7551P CPU, using Python 3.8 and Py Torch library [Paszke, 2019]. The specific version number for the PyTorch library is not provided.
Experiment Setup Yes Each method is run 3 times, with the best result reported. All experiments are conducted on an NVIDIA RTX A4000 GPU (16 GB) and an AMD EPYC 7551P CPU, using Python 3.8 and Py Torch library [Paszke, 2019]. ... The base step sizes for all algorithms and the final step size for Ada Bound are selected from the range {1 10 5, 5 10 5, 1 10 4, , 1}. The parameter β1 is selected from {0.9, 0.99}, and β2 is chosen from {0.99, 0.999}. SGD tunes momentum from the set {0.1, 0.2, , 0.9}. For all image classification tasks, adaptive gradient methods use a step size of 0.001, β1 = 0.9, and β2 = 0.999, while SGD employs a step size of 0.1 and momentum of 0.9. For language modeling tasks, SGD utilizes step sizes of 30 for LSTM and 0.1 for Transformer. For LSTM, adaptive methods apply a step size of 0.01 with ε = 10 16, β1 = 0.9, and β2 = 0.999. For Transformer, a step size of 0.001, with β1 = 0.9 and β2 = 0.999, is adopted for all adaptive methods. ... A weight decay of 5 10 4 is applied to all algorithms. Each optimizer is executed for 100 epochs with a batch size of 128.