Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1]

Bayesian Quadrature for Neural Ensemble Search

Authors: Saad Hamid, Xingchen Wan, Martin Jørgensen, Binxin Ru, Michael A Osborne

TMLR 2023 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We show empirically in terms of test likelihood, accuracy, and expected calibration error that our method outperforms state-of-the-art baselines, and verify via ablation studies that its components do so independently. We undertake an empirical comparison of our proposals against state-of-the-art baselines. Additionally, we conduct ablation studies to understand the effect of our proposals for each stage of the NES pipeline.
Researcher Affiliation Academia Saad Hamid EMAIL University of Oxford Xingchen Wan EMAIL University of Oxford Martin Jørgensen EMAIL University of Oxford Binxin Ru EMAIL University of Oxford Michael Osborne EMAIL University of Oxford
Pseudocode Yes Algorithms 1, 2 and 3 summarise our propositions. Algorithm 1 Candidate set selection algorithm using a BQ acquisition function. Algorithm 2 Posterior recombination. Algorithm 3 Re-weighted stacking.
Open Source Code Yes 1An implementation of our proposals can be found at https://github.com/saadhamidml/bq-nes.
Open Datasets Yes We begin by performing comparisons on the NATS-Bench benchmark (Dong et al., 2021). Specifically, we use the provided topology search space... The architecture weights are trained for 200 epochs on the CIFAR-100 and Image Net16-120 (a smaller version of Image Net with 16 16 pixel input images, and 120 classes) datasets. We then proceed to compare the two variants of our algorithm BQ-R and BQ-S with several baselines...Table 4 presents the results on CIFAR-100 and Image Net16-120 for a range of ensemble sizes. Table 5: Test accuracy, expected calibration error (ECE), and log-likelihood (LL) on CIFAR-10 and CIFAR100 for BQ-S (our proposal) and NES-RE (the strongest baseline) for the Slimmable Network search space.
Dataset Splits Yes The architecture weights are trained for 200 epochs on the CIFAR-100 and Image Net16-120 (a smaller version of Image Net with 16 16 pixel input images, and 120 classes) datasets. We will compare ensemble performance as measured by test accuracy, test likelihood, and expected calibration error on the test set for a range of ensemble sizes. The test set is selected by ranking all the architectures in the search space by validation loss, and selecting every 25th architecture. This ensures that the test set contains architectures across the full range of performance. Additionally, our approximation is much faster as it does not require training a supernet. We argue that our approximation is suitable as most posterior mass will be concentrated on these architectures, so a good variational distribution will concentrate mass on them as well. Instead, we examine the setting where only the test set is shifted, and the validation set is representative of the training set.
Hardware Specification No The paper does not specify the exact hardware (GPU/CPU models, memory) used for running the experiments.
Software Dependencies No The codebase uses Py Torch (Paszke et al., 2019) to handle deep learning and backpropagation.
Experiment Setup Yes The architecture weights are trained for 200 epochs on the CIFAR-100 and Image Net16-120... We initialise with 10 architectures randomly selected from a uniform prior over the search space, and use the acquisition function to build a set of 150 architectures... Our proposal over the cell-based search space uses the WL kernel, with its level hyperparameter chosen from {1, 2} using the GP s marginal likelihood. For the macro-based search space, our proposal uses an ARD RBF kernel, whose hyperparameters are optimised using LBFGS. The lengthscales are constrained between the minimum and maximum distances between observations along the relevant dimensions. Architecture likelihoods are normalised so that the maximum observed is 1 before modelling with the GP surrogate. The noise hyperparameter is also selected to optimise the probability density assigned to the observed data under the GP prior. It is constrained in the range [10 5, 10 1]. The acquisition function is always optimised using an evolutionary strategy, using a pool size of 1024. Per iteration, we allow 128 mutations, of which 16 are modifications of the architecture with the highest acquisition value, and the remainder are selected uniformly at random from the pool.