reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Incorporating Unlabelled Data into Bayesian Neural Networks

Authors: Mrinank Sharma, Tom Rainforth, Yee Whye Teh, Vincent Fortuin

TMLR 2024 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We empirically demonstrate that the improved prior predictives of self-supervised BNNs translate to improved predictive performance, especially in problem settings with few labelled examples ( 5). In Table 2, we report the NLL for each BNN when making predictions with different numbers of labelled examples.
Researcher Affiliation	Academia	Mrinank Sharma EMAIL University of Oxford, UK; Tom Rainforth EMAIL University of Oxford, UK; Yee Whye Teh EMAIL University of Oxford, UK; Vincent Fortuin EMAIL Helmholtz AI, Munich, Germany Technical University of Munich, Germany
Pseudocode	Yes	Algorithm 1 Self-Supervised BNNs
Open Source Code	No	The paper references third-party libraries and their code (e.g., 'Bayesian-Torch' by Krishnan et al., 2022), but it does not provide an explicit statement or a link to the source code for the specific methodology or implementation described in this paper.
Open Datasets	Yes	We evaluate the performance of different BNNs on the CIFAR10 and CIFAR100 datasets, which are standard benchmarks within the BNN community. To assess out-of-distribution generalisation, we further evaluation on the CIFAR-10-C dataset (Hendrycks & Dietterich, 2018). Moreover, we evaluate whether these BNNs can detect out-of-distribution inputs from SVHN (Netzer et al., 2011) when trained on CIFAR10.
Dataset Splits	Yes	We evaluate the performance of different baselines when conditioning on 50, 500, 5000, and 50000 labels from the training set. For the evaluation protocols, we reserve a validation set of 1000 data points from the test set and evaluate using the remaining 9000 labels.
Hardware Specification	Yes	The vast majority of experiments were run on an internal compute cluster using Nvidia Tesla V100 or A100 GPUs.
Software Dependencies	No	The paper mentions software components such as the 'LARS optimiser', 'Adam optimiser', and the 'laplace library', but does not specify version numbers for any of these.
Experiment Setup	Yes	We use a N(0, 1/τp) prior over the linear parameters θt, and tune τp for each dataset. We use τp = 0.65 for CIFAR10 and τp = 0.6 for CIFAR100. We use weight decay 1e-6 for the base encoder and projection head parameters. We use the LARS optimiser (You et al., 2017), with batch size 1000 and momentum 0.9. We train for 1000 epochs, using a linear warmup cosine annealing learning rate schedule. The warmup starting learning rate for the base encoder parameters is 1e-3 with a maximum learning rate of 0.6. For the variational parameters, the maximum learning rate is 1e-3.