reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Greedy Bayesian Posterior Approximation with Deep Ensembles

Authors: Aleksei Tiulpin, Matthew B. Blaschko

TMLR 2022 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	The performance of our approach is demonstrated on computer vision out-of-distribution detection benchmarks in a range of architectures trained on multiple datasets.
Researcher Affiliation	Academia	Aleksei Tiulpin aleksei.tiulpin@oulu.ﬁ Research Unit of Medical Imaging, Physics and Technology Faculty of Medicine, University of Oulu, Finland Matthew B. Blaschko EMAIL Center for Processing Speech and Images Department of Electrical Engineering KU Leuven, Belgium
Pseudocode	Yes	Algorithm 1 Random Greedy algorithm Algorithm 2 O(k) Random Greedy-based algorithm for training ensembles of neural networks.
Open Source Code	Yes	The source code of our method is made publicly available at https://github.com/Oulu-IMEDS/greedy_ensembles_training.
Open Datasets	Yes	We ran our main experiments on CIFAR10, CIFAR100 (Krizhevsky, 2009) and SVHN (Netzer et al., 2011) in-distribution datasets. Our OOD detection benchmark included CIFAR10, CIFAR100, DTD (Cimpoi et al., 2014), SVHN (Netzer et al., 2011), LSUN (Yu et al., 2015), Tiny Image Net (Le & Yang, 2015), Places 365 (Zhou et al., 2017), Bernoulli noise images, Gaussian noise, random blobs image, and uniform noise images. [...] In addition to the CIFAR and SVHN experiments, we used MNIST (Le Cun et al., 1998) with Res Net8.
Dataset Splits	Yes	We used validation set accuracy (10% of the training data; randomly chosen stratiﬁed split) to select the models when optimizing the marginal gain. The best snapshot was found using the validation data, was then selected for ﬁnal testing. When selecting the models for evaluation on OOD data, we ﬁrst evaluated ensembles on the in-distribution test set (Appendix C.2).
Hardware Specification	Yes	All our models in the ensembles were trained for 100 epochs using Py Torch (Paszke et al., 2019), each ensemble on a single NVIDIA V100 GPU.
Software Dependencies	No	All our models in the ensembles were trained for 100 epochs using Py Torch (Paszke et al., 2019) [...] For the synthetic data experiments, we used scikit-learn (Pedregosa et al., 2011)
Experiment Setup	Yes	The main training hyper-parameters were adapted from (Maddox et al., 2019) (see Table C2), but with additional modiﬁcations inspired by (Malinin & Gales, 2018; Smith & Topin, 2019), which helped to train the CIFAR models to state-of-the-art performance in only 100 epochs. As such, we ﬁrst employed a warm-up of the learning rate (LR) from a value 10 times lower than the initial LR (LRinit in Table C2) for 5 epochs. Subsequently, after 50% of the training budget, we linearly annealed the LR to the value of LR lrscale until 90% of the training budget is reached, after which we kept the value of LR constant. All models were trained using stochastic gradient descent with momentum of 0.9 and a total batch size of 128. We employed standard training augmentations horizontal ﬂipping, reﬂective padding to 34x34, and random crop to 34x34 pixels.