Learning to Optimize Quasi-Newton Methods
Authors: Isaac Liao, Rumen Dangovski, Jakob Nicolaus Foerster, Marin Soljacic
TMLR 2023 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We experimentally verify that our algorithm can optimize in noisy settings, and show that simpler alternatives for representing the inverse Hessians worsen performance. Lastly, we use our optimizer to train a semi-realistic deep neural network with 95k parameters at speeds comparable to those of standard neural network optimizers. We present here a number of tasks which provide experimental evidence for the theoretical results we claim, though we test a variety of optimizers on these tasks. Section 5 describes experiments on: 5.1 Noisy Quadratic Bowl, 5.2 Rosenbrock Function, 5.3 Image Generation, 5.4 Image Classification. |
| Researcher Affiliation | Academia | Isaac C. Liao EMAIL Research Lab for Electronics MIT; Rumen R. Dangovski EMAIL Research Lab for Electronics MIT; Jakob N. Foerster EMAIL Department of Engineering Science University of Oxford; Marin Soljačić EMAIL Research Lab for Electronics MIT |
| Pseudocode | Yes | Algorithm 1 Learning to Optimize During Optimization (LODO) Require: f : Rn R: Function to minimize. Require: x0 Rn: Initialization. Require: α R: Meta-learning rate (default 0.001), Require: α0 R: Initial learning rate (default 1.0), Require: 0 β < 1: Momentum (default 0.9), t 0 Start time θ0 random initialization Initialization for G neural network m0 0 Initialize momentum while not converged do xt+1 xt G(θt)mt Pick a step using G with Eqs. (1) and (2) ℓt+1 f(xt+1) Compute loss after step θt+1 θt + Adam( θtℓt+1) Tune the G model to pick better steps mt+1 βmt + (1 β) xt+1ℓt+1 Update momentum t t + 1 Increment time end while return θt |
| Open Source Code | No | The paper does not provide a direct link to source code or explicitly state that the code will be made available in supplementary materials or a repository. |
| Open Datasets | Yes | We use our optimizer to train a semi-realistic deep neural network with 95k parameters in an autoregressive image generation task similar to training a Pixel CNN (Oord et al., 2016) to generate MNIST images (Lecun et al., 1998). We conduct an experiment on image classification with Resnet18 (He et al., 2016) on CIFAR10 (Krizhevsky et al., 2009). |
| Dataset Splits | Yes | Right: Validation loss by step, using a subset of 64 images excluded from the training data. We use the standard Resnet setup on CIFAR10 by replacing the 7x7 convolution with a 3x3 one, and removing the maxpool in the first convolutional block. We use the standard data augmentation and a batch size of 2048. |
| Hardware Specification | Yes | We performed all optimization runs in Tensor Flow 2, each with 40 Intel Xeon Gold 6248 CPUs and 2 Nvidia Volta V10 GPUs. |
| Software Dependencies | Yes | We performed all optimization runs in Tensor Flow 2 |
| Experiment Setup | Yes | In every experiment, we tuned the hyperparameters of each optimizer using a genetic algorithm of 10 generations and 32 individuals per generation (16 individuals per generation for the Resnet CIFAR10 task). The tuned hyperparameters can be found in Table 6. For Image Classification, we used a batch size of 2048. For Image Generation, parameters were initialized with Le Cun normal initialization. |