reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Understanding the robustness difference between stochastic gradient descent and adaptive gradient methods

Authors: Avery Ma, Yangchen Pan, Amir-massoud Farahmand

TMLR 2023 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We empirically show that while the difference between the standard generalization performance of models trained using these methods is small, those trained using SGD exhibit far greater robustness under input perturbations. Notably, our investigation demonstrates the presence of irrelevant frequencies in natural datasets, where alterations do not affect models generalization performance. However, models trained with adaptive methods show sensitivity to these changes, suggesting that their use of irrelevant frequencies can lead to solutions sensitive to perturbations. To better understand this difference, we study the learning dynamics of gradient descent (GD) and sign gradient descent (sign GD) on a synthetic dataset that mirrors natural signals. With a three-dimensional input space, the models optimized with GD and sign GD have standard risks close to zero but vary in their adversarial risks. Our result shows that linear models robustness to ℓ2-norm bounded changes is inversely proportional to the model parameters weight norm: a smaller weight norm implies better robustness. In the context of deep learning, our experiments show that SGD-trained neural networks have smaller Lipschitz constants, explaining the better robustness to input perturbations than those trained with adaptive gradient methods.
Researcher Affiliation	Academia	Avery Ma EMAIL University of Toronto Vector Institute Yangchen Pan EMAIL University of Oxford Amir-massoud Farahmand EMAIL University of Toronto Vector Institute
Pseudocode	No	The paper describes algorithms and methods in prose but does not present any explicit pseudocode blocks or algorithm listings with structured formatting.
Open Source Code	Yes	Our source code is available at https://github.com/averyma/opt-robust.
Open Datasets	Yes	As a first step, we compare how models, trained with SGD, Adam, and RMSProp, differ in their standard generalization and robustness on seven benchmark datasets (Le Cun, 1998; Xiao et al., 2017; Krizhevsky & Hinton, 2009; Netzer et al., 2011; Howard; Fei-Fei et al., 2004). Additionally, we extend our analysis to include results from experiments on Vision Transformers (Dosovitskiy et al., 2021) and an audio dataset (Warden, 2018).
Dataset Splits	Yes	In our experiments, we evaluate standard generalization using the accuracy of the trained classifier on the original test dataset. To measure robustness, we consider the classification accuracy on the test dataset perturbed by Gaussian noise, ℓ2 and ℓ bounded adversarial perturbations (Croce & Hein, 2020). We follow the default Pytorch configuration to train all the models and sweep through a wide range of learning rates. The final model is selected with the highest validation accuracy.
Hardware Specification	No	The paper does not provide specific details about the hardware used for running the experiments, such as GPU or CPU models. It only mentions general funding sources in the acknowledgments: "Resources used in preparing this research were provided, in part, by the Province of Ontario, the Government of Canada through CIFAR, and companies sponsoring the Vector Institute."
Software Dependencies	Yes	For all models, we use the following default Py Torch (v1.12.1) optimization settings.
Experiment Setup	Yes	We follow the default Pytorch configuration to train all the models and sweep through a wide range of learning rates. The final model is selected with the highest validation accuracy. For SGD, we disable all of the following mechanism: dampening, weight decay, and Nesterov. For Adam, we use the default values of β1 = 0.9 β2 = 0.999, ϵ = 10 8 and disable weight decay and disable AMSgrad (Reddi et al., 2018). For RMSProp, we use default values of α = 0.99, ϵ = 10 8, and disable momentum and disable centered RMSProp which normalizes the gradient by an estimation of its variance. All models are trained for 200 epochs. In Table 2, we list the initial learning rate. The learning rate decreases by a factor of 0.1 at epoch 100 and 150.