A Unified Approach to Controlling Implicit Regularization via Mirror Descent

Authors: Haoyuan Sun, Khashayar Gatmiry, Kwangjun Ahn, Navid Azizan

JMLR 2023 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Through comprehensive experiments, we demonstrate that MD is a versatile method to produce learned models with different regularizers, which in turn have different generalization performances. In Section 4, we investigate the implications of our theoretical findings by applying a subclass of MD that is both efficient and scalable. Our experiments involving linear models corroborate our theoretical results in Section 3, and real-world experiments with deep neural networks and popular datasets suggest that our findings carry over to such nonlinear settings.
Researcher Affiliation Academia Haoyuan Sun EMAIL Khashayar Gatmiry EMAIL Kwangjun Ahn EMAIL Navid Azizan EMAIL Massachusetts Institute of Technology Cambridge, MA 02139, USA
Pseudocode Yes Listing 1: Sample Py Torch implementation of p-GD
Open Source Code No To illustrate that p-GD can be easily implemented, we show a proof-of-concept implementation in Py Torch. This implementation can directly replace existing optimizers and thus require only minor changes to any existing training code. (Appendix H)
Open Datasets Yes Image classification on MNIST. For a more involved example, we apply p-GD to the MNIST dataset (Le Cun et al., 1998). Specifically, we perform a set of experiments on the CIFAR-10 dataset (Krizhevsky et al., 2009). Image Net experiments. We also perform a similar set of experiments on the Image Net dataset (Russakovsky et al., 2015).
Dataset Splits Yes Image classification on MNIST. For a more involved example, we apply p-GD to the MNIST dataset (Le Cun et al., 1998). For this task, we use two different architectures: 1) a 2-layer fully connected network with 300 hidden neurons and Re LU activation, and 2) a convolutional network with two convolution layers and batch-norm. We train the fully connected network for 200 epochs and the convolution network for 50 epochs. The detailed specification of this experiment can be found in Appendix I. For the experiments with the CIFAR-10 dataset, we adopted the example implementation from the FFCV library. For the experiments with the Image Net dataset, we used the example implementation from the FFCV library.
Hardware Specification Yes All of the following experiments were performed on compute nodes equipped with an Intel Skylake CPU + one Nvidia V100 GPU.
Software Dependencies No To illustrate that p-GD can be easily implemented, we show a proof-of-concept implementation in Py Torch. For the experiments with the CIFAR-10 dataset, we adopted the example implementation from the FFCV library. For the experiments with the Image Net dataset, we used the example implementation from the FFCV library. (No specific version numbers for PyTorch or FFCV library are provided.)
Experiment Setup Yes We ran p-GD with fixed step size 10^-3 for 1 million steps. We used a fixed step size of η = 10^-4 and ran one million iterations for different p s. As for normalized mirror descent update (9), we use a base step size η0 = 10^-3 and scale λ = 10^-3. For the fully connected network, we train for 200 epochs in total and use a learning rate schedule that starts with η = 0.1 and decays by a factor of 5 at the 120th, 150th, and 180th epochs. For both models, we applied cross-entropy loss and batch size of 512. We used a cyclic learning rate schedule with a maximum learning rate of 0.1 and ran for 400 epochs. We used a cyclic learning rate schedule with a maximum learning rate of 0.5 and ran for 120 epochs.