A Unified Approach to Controlling Implicit Regularization via Mirror Descent
Authors: Haoyuan Sun, Khashayar Gatmiry, Kwangjun Ahn, Navid Azizan
JMLR 2023 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Through comprehensive experiments, we demonstrate that MD is a versatile method to produce learned models with different regularizers, which in turn have different generalization performances. In Section 4, we investigate the implications of our theoretical findings by applying a subclass of MD that is both efficient and scalable. Our experiments involving linear models corroborate our theoretical results in Section 3, and real-world experiments with deep neural networks and popular datasets suggest that our findings carry over to such nonlinear settings. |
| Researcher Affiliation | Academia | Haoyuan Sun EMAIL Khashayar Gatmiry EMAIL Kwangjun Ahn EMAIL Navid Azizan EMAIL Massachusetts Institute of Technology Cambridge, MA 02139, USA |
| Pseudocode | Yes | Listing 1: Sample Py Torch implementation of p-GD |
| Open Source Code | No | To illustrate that p-GD can be easily implemented, we show a proof-of-concept implementation in Py Torch. This implementation can directly replace existing optimizers and thus require only minor changes to any existing training code. (Appendix H) |
| Open Datasets | Yes | Image classification on MNIST. For a more involved example, we apply p-GD to the MNIST dataset (Le Cun et al., 1998). Specifically, we perform a set of experiments on the CIFAR-10 dataset (Krizhevsky et al., 2009). Image Net experiments. We also perform a similar set of experiments on the Image Net dataset (Russakovsky et al., 2015). |
| Dataset Splits | Yes | Image classification on MNIST. For a more involved example, we apply p-GD to the MNIST dataset (Le Cun et al., 1998). For this task, we use two different architectures: 1) a 2-layer fully connected network with 300 hidden neurons and Re LU activation, and 2) a convolutional network with two convolution layers and batch-norm. We train the fully connected network for 200 epochs and the convolution network for 50 epochs. The detailed specification of this experiment can be found in Appendix I. For the experiments with the CIFAR-10 dataset, we adopted the example implementation from the FFCV library. For the experiments with the Image Net dataset, we used the example implementation from the FFCV library. |
| Hardware Specification | Yes | All of the following experiments were performed on compute nodes equipped with an Intel Skylake CPU + one Nvidia V100 GPU. |
| Software Dependencies | No | To illustrate that p-GD can be easily implemented, we show a proof-of-concept implementation in Py Torch. For the experiments with the CIFAR-10 dataset, we adopted the example implementation from the FFCV library. For the experiments with the Image Net dataset, we used the example implementation from the FFCV library. (No specific version numbers for PyTorch or FFCV library are provided.) |
| Experiment Setup | Yes | We ran p-GD with fixed step size 10^-3 for 1 million steps. We used a fixed step size of η = 10^-4 and ran one million iterations for different p s. As for normalized mirror descent update (9), we use a base step size η0 = 10^-3 and scale λ = 10^-3. For the fully connected network, we train for 200 epochs in total and use a learning rate schedule that starts with η = 0.1 and decays by a factor of 5 at the 120th, 150th, and 180th epochs. For both models, we applied cross-entropy loss and batch size of 512. We used a cyclic learning rate schedule with a maximum learning rate of 0.1 and ran for 400 epochs. We used a cyclic learning rate schedule with a maximum learning rate of 0.5 and ran for 120 epochs. |