Understanding the robustness difference between stochastic gradient descent and adaptive gradient methods
Authors: Avery Ma, Yangchen Pan, Amir-massoud Farahmand
TMLR 2023 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We empirically show that while the difference between the standard generalization performance of models trained using these methods is small, those trained using SGD exhibit far greater robustness under input perturbations. Notably, our investigation demonstrates the presence of irrelevant frequencies in natural datasets, where alterations do not affect models generalization performance. However, models trained with adaptive methods show sensitivity to these changes, suggesting that their use of irrelevant frequencies can lead to solutions sensitive to perturbations. To better understand this difference, we study the learning dynamics of gradient descent (GD) and sign gradient descent (sign GD) on a synthetic dataset that mirrors natural signals. With a three-dimensional input space, the models optimized with GD and sign GD have standard risks close to zero but vary in their adversarial risks. Our result shows that linear models robustness to ℓ2-norm bounded changes is inversely proportional to the model parameters weight norm: a smaller weight norm implies better robustness. In the context of deep learning, our experiments show that SGD-trained neural networks have smaller Lipschitz constants, explaining the better robustness to input perturbations than those trained with adaptive gradient methods. |
| Researcher Affiliation | Academia | Avery Ma EMAIL University of Toronto Vector Institute Yangchen Pan EMAIL University of Oxford Amir-massoud Farahmand EMAIL University of Toronto Vector Institute |
| Pseudocode | No | The paper describes algorithms and methods in prose but does not present any explicit pseudocode blocks or algorithm listings with structured formatting. |
| Open Source Code | Yes | Our source code is available at https://github.com/averyma/opt-robust. |
| Open Datasets | Yes | As a first step, we compare how models, trained with SGD, Adam, and RMSProp, differ in their standard generalization and robustness on seven benchmark datasets (Le Cun, 1998; Xiao et al., 2017; Krizhevsky & Hinton, 2009; Netzer et al., 2011; Howard; Fei-Fei et al., 2004). Additionally, we extend our analysis to include results from experiments on Vision Transformers (Dosovitskiy et al., 2021) and an audio dataset (Warden, 2018). |
| Dataset Splits | Yes | In our experiments, we evaluate standard generalization using the accuracy of the trained classifier on the original test dataset. To measure robustness, we consider the classification accuracy on the test dataset perturbed by Gaussian noise, ℓ2 and ℓ bounded adversarial perturbations (Croce & Hein, 2020). We follow the default Pytorch configuration to train all the models and sweep through a wide range of learning rates. The final model is selected with the highest validation accuracy. |
| Hardware Specification | No | The paper does not provide specific details about the hardware used for running the experiments, such as GPU or CPU models. It only mentions general funding sources in the acknowledgments: "Resources used in preparing this research were provided, in part, by the Province of Ontario, the Government of Canada through CIFAR, and companies sponsoring the Vector Institute." |
| Software Dependencies | Yes | For all models, we use the following default Py Torch (v1.12.1) optimization settings. |
| Experiment Setup | Yes | We follow the default Pytorch configuration to train all the models and sweep through a wide range of learning rates. The final model is selected with the highest validation accuracy. For SGD, we disable all of the following mechanism: dampening, weight decay, and Nesterov. For Adam, we use the default values of β1 = 0.9 β2 = 0.999, ϵ = 10 8 and disable weight decay and disable AMSgrad (Reddi et al., 2018). For RMSProp, we use default values of α = 0.99, ϵ = 10 8, and disable momentum and disable centered RMSProp which normalizes the gradient by an estimation of its variance. All models are trained for 200 epochs. In Table 2, we list the initial learning rate. The learning rate decreases by a factor of 0.1 at epoch 100 and 150. |