Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1]

Understanding AdamW through Proximal Methods and Scale-Freeness

Authors: Zhenxun Zhuang, Mingrui Liu, Ashok Cutkosky, Francesco Orabona

TMLR 2022 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this paper, we tackle this question from both an optimization and an empirical point of view. First, we show how to re-interpret Adam W as an approximation of a proximal gradient method, which takes advantage of the closed-form proximal mapping of the regularizer instead of only utilizing its gradient information as in Adam-ℓ2. Next, we consider the property of scale-freeness enjoyed by Adam W and by its proximal counterpart: their updates are invariant to component-wise rescaling of the gradients. We provide empirical evidence across a wide range of deep learning experiments showing a correlation between the problems in which Adam W exhibits an advantage over Adam-ℓ2 and the degree to which we expect the gradients of the network to exhibit multiple scales, thus motivating the hypothesis that the advantage of Adam W could be due to the scale-free updates. Section 4 is titled "Deep Learning Empirical Evaluation".
Researcher Affiliation Academia Zhenxun Zhuang EMAIL Boston University; Mingrui Liu EMAIL George Mason University; Ashok Cutkosky EMAIL Boston University; Francesco Orabona EMAIL Boston University
Pseudocode Yes Algorithm 1 Adam with ℓ2 regularization (Adam-ℓ2) and Adam W Loshchilov & Hutter (2017).; Algorithm 2 Ada Grad (Duchi et al., 2010a; Mc Mahan & Streeter, 2010); Algorithm 3 Ada Grad with Restart
Open Source Code No The text discusses the source code of a third-party tool or platform that the authors used, but does not provide their own implementation code. The paper mentions: "1https://github.com/akamaster/pytorch_resnet_cifar10", "2https://github.com/bearpaw/pytorch-classification". These are external model implementations, not the authors' code for their specific methods.
Open Datasets Yes We consider the image classification task on CIFAR-10/100 datasets.
Dataset Splits No The paper mentions using CIFAR-10/100 datasets and describes data augmentation techniques, but it does not explicitly provide training/test/validation split percentages, sample counts, or a citation to specific predefined splits within the text.
Hardware Specification No The paper does not provide any specific details about the hardware used to run the experiments, such as GPU or CPU models.
Software Dependencies No The paper mentions "Tensorflow and Pytorch" when comparing algorithms but does not provide specific version numbers for any software dependencies.
Experiment Setup Yes For both Adam-ℓ2 and Adam W, we set β1 = 0.9, β2 = 0.999, ϵ = 10 8 as suggested in the original Adam paper Kingma & Ba (2015). To set the initial step size α and weight decay parameter λ, we grid search over {0.00005, 0.0001, 0.0005, 0.001, 0.005} for α and {0, 0.00001, 0.00005, 0.0001, 0.0005, 0.001} for λ. Whenever the best performing hyperparameters lie in the boundary of the searching grid, we always extend the grid to ensure that the final best-performing hyperparameters fall into the interior of the grid. [...] We use a mini-batch of 128, and train 300 epochs unless otherwise specified.