Hyperband: A Novel Bandit-Based Approach to Hyperparameter Optimization

Authors: Lisha Li, Kevin Jamieson, Giulia DeSalvo, Afshin Rostamizadeh, Ameet Talwalkar

JMLR 2017 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We compare Hyperband with popular Bayesian optimization methods on a suite of hyperparameter optimization problems. We observe that Hyperband can provide over an order-of-magnitude speedup over our competitor set on a variety of deep-learning and kernel-based learning problems. In this section, we evaluate the empirical behavior of Hyperband with three different resource types: iterations, data set subsamples, and feature samples. For all experiments, we compare Hyperband with three well known Bayesian optimization algorithms SMAC, TPE, and Spearmint using their default settings.
Researcher Affiliation Collaboration Lisha Li EMAIL Carnegie Mellon University, Pittsburgh, PA 15213 Kevin Jamieson EMAIL University of Washington, Seattle, WA 98195 Giulia De Salvo EMAIL Google Research, New York, NY 10011 Afshin Rostamizadeh EMAIL Google Research, New York, NY 10011 Ameet Talwalkar EMAIL Carnegie Mellon University, Pittsburgh, PA 15213 Determined AI
Pseudocode Yes Algorithm 1: Hyperband algorithm for hyperparameter optimization. Figure 9: (Bottom) The Hyperband algorithm for the infinite horizon setting. Hyperband calls Successive Halving as a subroutine. Figure 10: The finite horizon Successive Halving and Hyperband algorithms are inspired by their infinite horizon counterparts of Figure 9 to handle practical constraints. Hyperband calls Successive Halving as a subroutine.
Open Source Code No Code and description of algorithm used is available at http://deeplearning.net/tutorial/lenet.html. This URL refers to the LeNet model used for an example application, not the Hyperband algorithm itself. The paper does not provide a link or explicit statement about the open-source availability of their Hyperband implementation.
Open Datasets Yes We work with the MNIST data set and optimize hyperparameters for the Le Net convolutional neural network... Data sets: We considered three image classification data sets: CIFAR-10 (Krizhevsky, 2009), rotated MNIST with background images (MRBI) (Larochelle et al., 2007), and Street View House Numbers (SVHN) (Netzer et al., 2011). We used the framework introduced by Feurer et al. (2015), which explored a structured hyperparameter search space comprised of 15 classifiers, 14 feature preprocessing methods, and 4 data preprocessing methods for a total of 110 hyperparameters.
Dataset Splits Yes Each data set was split into a training, validation, and test set: (1) CIFAR-10 has 40k, 10k, and 10k instances; (2) MRBI has 10k, 2k, and 50k instances; and (3) SVHN has close to 600k, 6k, and 26k instances for training, validation, and test respectively. Feurer et al. (2015) split each data set into 2/3 training and 1/3 test set, whereas we introduce a validation set to avoid overfitting to the test data. We also used 2/3 of the data for training, but split the rest of the data into two equally sized validation and test sets.
Hardware Specification Yes The experiments took the equivalent of over 1 year of GPU hours on NVIDIA GRID K520 cards available on Amazon EC2 g2.8xlarge instances. All experiments were performed on Google Cloud Compute n1-standard-1 instances in us-central1-f region with 1 CPU and 3.75GB of memory. Each hyperparameter optimization algorithm was run for ten trials on Amazon EC2 m4.2xlarge instances; We ran 10 trials of each searcher, with each trial lasting 12 hours on a n1-standard-16 machine from Google Cloud Compute.
Software Dependencies No The exact architecture used is the 18% model provided on cuda-convnet for CIFAR-10. The width of the response normalization layer was excluded due to limitations of the Caffe framework. The default SVM method in Scikit-learn is single core and takes hours to train on CIFAR-10. The paper mentions various software components and frameworks (cuda-convnet, Caffe, Scikit-learn) but does not provide specific version numbers for any of them.
Experiment Setup Yes Our search space includes learning rate, batch size, and number of kernels for the two layers of the network as hyperparameters (details are shown in Table 2 in Appendix A). We define the resource allocated to each configuration to be number of iterations of SGD, with one unit of resource corresponding to one epoch, i.e., a full pass over the data set. We set R to 81 and use the default value of η = 3, resulting in smax = 4 and thus 5 brackets of Successive Halving with different tradeoffs between n and B/n. For CIFAR-10 and MRBI, R was set to 300 (or 30k total iterations). For SVHN, R was set to 600 (or 60k total iterations) to accommodate the larger training set. Given R for these experiments, we set η = 4 to yield five Successive Halving brackets for Hyperband.