A Momentumized, Adaptive, Dual Averaged Gradient Method
Authors: Aaron Defazio, Samy Jelassi
JMLR 2022 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We present results across a large number of problems across both categories to validate the general purpose utility of the MADGRAD approach. In our experiments we use the most common step-size reduction scheme used in the literature for each respective problem. For all algorithms, we performed a learning rate and decay sweep on a grid on intervals of [1 10i, 2.5 10i, 5 10i] for a range of i large enough to ensure the best parameters for each problem and method were considered. We present the results from the best learning rate and decay for each method when considering test set performance. |
| Researcher Affiliation | Collaboration | Aaron Defazio EMAIL Facebook AI Research, New York Samy Jelassi EMAIL Princeton University, Princeton |
| Pseudocode | Yes | Algorithm 1 MADGRAD Require: γk stepsize sequence, ck momentum sequence, initial point x0, epsilon ϵ 1: s0 : d = 0, ν0 : d = 0 2: for k = 0, . . . , T do 3: Sample ξk and set gk = f(xk, ξk) 5: sk+1 = sk + λkgk 6: νk+1 = νk + λk (gk gk) zk+1 = x0 1 3 νk+1 + ϵ sk+1 8: xk+1 = (1 ck+1) xk + ck+1zk+1. 10: return x T |
| Open Source Code | Yes | An implementation is available at https://github.com/facebookresearch/madgrad |
| Open Datasets | Yes | CIFAR10 (Krizhevsky, 2009) is an established baseline method within the deep learning community due to its manageable size and representative performance within the class of data-limited supervised image classification problems. The Image Net problem (Krizhevsky et al., 2012) is a larger problem more representative of image classification problems encountered in industrial applications where a large number of classes and higher resolution input images are encountered. The fast MRI Knee challenge (Zbontar et al., 2018) is a recently proposed large-scale image2-image problem. For a machine translation baseline we trained our model on the IWSLT14 Germain-to English dataset (Cettolo et al., 2014), using a popular LSTM variant introduced by Wiseman and Rush (2016). We performed our experiments using the Ro BERTa variant of BERT_BASE (Liu et al., 2019), a 110M parameter transformer model. |
| Dataset Splits | Yes | Following standard practice, we apply a data-augmentation step consisting of random horizontal flipping, 4px padding followed by random cropping to 32px at training time only. Our setup used data preprocessing consisting of a mean [0.485, 0.456, 0.406] and std [0.229, 0.224, 0.225] normalization of the three respective color channels, followed by a Random Resized Crop Py Torch operation to reduce the resolution to 224 pixels followed by a random 50% chance of horizontal flipping. For test set evaluation a resize to 256 pixels followed by a center crop to 224 pixels is used instead. |
| Hardware Specification | Yes | GPUs 1x V100 GPUs 8x V100 |
| Software Dependencies | No | The Ada Grad implementations available in major deep learning frameworks (Py Torch, Tensorflow) contain the mirror descent form only. Our implementation used Fair Seq defaults except for the parameters listed below. |
| Experiment Setup | Yes | Hyper-parameter Value Architecture Pre Act Res Net152 Epochs 300 GPUs 1x V100 Batch Size per GPU 128 LR schedule 150-225 tenthing Seeds 10 Method LR Decay MADGRAD 2.5e-4 0.0001 Ada Grad 0.01 0.0001 Adam 0.00025 0.0001 SGD 0.1 0.0001 |