Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1]
Understanding AdamW through Proximal Methods and Scale-Freeness
Authors: Zhenxun Zhuang, Mingrui Liu, Ashok Cutkosky, Francesco Orabona
TMLR 2022 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this paper, we tackle this question from both an optimization and an empirical point of view. First, we show how to re-interpret Adam W as an approximation of a proximal gradient method, which takes advantage of the closed-form proximal mapping of the regularizer instead of only utilizing its gradient information as in Adam-ℓ2. Next, we consider the property of scale-freeness enjoyed by Adam W and by its proximal counterpart: their updates are invariant to component-wise rescaling of the gradients. We provide empirical evidence across a wide range of deep learning experiments showing a correlation between the problems in which Adam W exhibits an advantage over Adam-ℓ2 and the degree to which we expect the gradients of the network to exhibit multiple scales, thus motivating the hypothesis that the advantage of Adam W could be due to the scale-free updates. Section 4 is titled "Deep Learning Empirical Evaluation". |
| Researcher Affiliation | Academia | Zhenxun Zhuang EMAIL Boston University; Mingrui Liu EMAIL George Mason University; Ashok Cutkosky EMAIL Boston University; Francesco Orabona EMAIL Boston University |
| Pseudocode | Yes | Algorithm 1 Adam with ℓ2 regularization (Adam-ℓ2) and Adam W Loshchilov & Hutter (2017).; Algorithm 2 Ada Grad (Duchi et al., 2010a; Mc Mahan & Streeter, 2010); Algorithm 3 Ada Grad with Restart |
| Open Source Code | No | The text discusses the source code of a third-party tool or platform that the authors used, but does not provide their own implementation code. The paper mentions: "1https://github.com/akamaster/pytorch_resnet_cifar10", "2https://github.com/bearpaw/pytorch-classification". These are external model implementations, not the authors' code for their specific methods. |
| Open Datasets | Yes | We consider the image classification task on CIFAR-10/100 datasets. |
| Dataset Splits | No | The paper mentions using CIFAR-10/100 datasets and describes data augmentation techniques, but it does not explicitly provide training/test/validation split percentages, sample counts, or a citation to specific predefined splits within the text. |
| Hardware Specification | No | The paper does not provide any specific details about the hardware used to run the experiments, such as GPU or CPU models. |
| Software Dependencies | No | The paper mentions "Tensorflow and Pytorch" when comparing algorithms but does not provide specific version numbers for any software dependencies. |
| Experiment Setup | Yes | For both Adam-ℓ2 and Adam W, we set β1 = 0.9, β2 = 0.999, ϵ = 10 8 as suggested in the original Adam paper Kingma & Ba (2015). To set the initial step size α and weight decay parameter λ, we grid search over {0.00005, 0.0001, 0.0005, 0.001, 0.005} for α and {0, 0.00001, 0.00005, 0.0001, 0.0005, 0.001} for λ. Whenever the best performing hyperparameters lie in the boundary of the searching grid, we always extend the grid to ensure that the final best-performing hyperparameters fall into the interior of the grid. [...] We use a mini-batch of 128, and train 300 epochs unless otherwise specified. |