reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Do Bayesian Neural Networks Actually Behave Like Bayesian Models?

Authors: Gábor Pituk, Vik Shirvaikar, Tom Rainforth

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We empirically investigate how well popular approximate inference algorithms for Bayesian Neural Networks (BNNs) respect the theoretical properties of Bayesian belief updating. We find strong evidence on synthetic regression and real-world image classification tasks that common BNN algorithms such as variational inference, Laplace approximation, SWAG, and SGLD fail to update in a consistent manner, forget about old data under sequential updates, and violate the predictive coherence properties that would be expected of Bayesian methods. These observed behaviors imply that care should be taken when treating BNNs as true Bayesian models, particularly when using them beyond static prediction settings, such as for active, continual, or transfer learning.
Researcher Affiliation	Academia	Gábor Pituk 1 Vik Shirvaikar 1 Tom Rainforth 1 1Department of Statistics, University of Oxford, Oxford, UK.
Pseudocode	No	The paper describes algorithms such as Hamiltonian Monte Carlo, Variational Inference, Laplace Approximation, SWAG, and SGLD in Section B, but these are explained in descriptive text and do not appear as structured pseudocode blocks or clearly labeled algorithm sections.
Open Source Code	Yes	We make our fork of their codebase available at github.com/pitukg/bnn_seq_vi/tree/master/bnn_hmc.
Open Datasets	Yes	We find on synthetic regression tasks and the CIFAR and IMDB image and text classification settings of Izmailov et al. (2021b) that BNNs fail to preserve key features of Bayesian inference.
Dataset Splits	Yes	We partition our synthetic regression dataset into N = 5 equal groups based on the x value, and run sequential approximate inference. We use the CIFAR-10 dataset for this experiment, and consider taking two random subsets of 4080 images each: a labeled split (x, y), and an unlabeled split x . We randomly split the training sets into two splits D(1) and D(2).
Hardware Specification	No	The paper does not explicitly state the specific hardware (e.g., GPU/CPU models, memory details) used for its own experiments. It mentions other researchers' use of "hundreds of Tensor Processor Units" when discussing HMC, but not for the experiments conducted in this paper.
Software Dependencies	Yes	To carry out our experiments we use Num Pyro (Phan et al., 2019; Bingham et al., 2019), a probabilistic programming library in Python built on JAX (Bradbury et al., 2018). Our SWAG implementation relies on the Optax SWAG library (activatedgeek, 2023).
Experiment Setup	Yes	We follow the hyper-parameters from Table 4 of Izmailov et al. (2021b). We use two fully connected BNN architectures with hidden layers of size 32, 32, 16, and 128, 256, 128, 64, respectively. We pick β = 0.325 for the small network and β = 0.05 for the larger network... We pick λ = 300 for the small network and λ = 2000 for the large network... ...η = 0.02 for the small network and η = 0.01 for the larger network.