reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Microcanonical Langevin Ensembles: Advancing the Sampling of Bayesian Neural Networks

Authors: Emanuel Sommer, Jakob Robnik, Giorgi Nozadze, Uros Seljak, David Rügamer

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this section, we evaluate the feasibility of applying MILE to sampling-based inference for BNNs. Datasets and models: We replicate the benchmark by Sommer et al. (2024) but also extend it to other datasets (Ionosphere, Income, IMDB, MNIST, F-MNIST) and models (convolutional and attention-based neural networks). Methods: As in Sommer et al. (2024), we investigate the improvement of our proposed approach over a DE, but also compare against the current state-of-the-art BDE approach based on the NUTS sampler. Runtime comparisons: Following this, we conduct a series of ablation studies, carefully examining how MILE scales compared to BDE. Tuning and hyperparameters: We finally validate the robustness of MILE s hyperparameters, supporting our claim that MILE is an auto-tuned off-the-shelf procedure like NUTS.
Researcher Affiliation	Collaboration	Emanuel Sommer Department of Statistics, LMU Munich Munich Center for Machine Learning (MCML) Munich, Germany EMAIL Jakob Robnik Physics Department University of California Berkeley, USA jakob EMAIL Giorgi Nozadze Department of Statistics, LMU Munich Eraneos Analytics Germany Gmb H Munich, Germany EMAIL
Pseudocode	No	The paper includes Figure 1: Flowchart illustrating our proposed procedure for obtaining a Microcanonical Langevin Ensemble (MILE) for BNNs. The process involves three main stages: optimization, MCLMC warmup and tuning, and MCLMC sampling. These steps are parallelized to generate an ensemble of K members. The number of MCLMC steps for each tuning phase and the final sampling phase are annotated, and carryovers between stages are highlighted in circles. This is a flowchart, not structured pseudocode or an algorithm block.
Open Source Code	Yes	Our code is available at https://github.com/EmanuelSommer/MILE.
Open Datasets	Yes	Datasets and models: We replicate the benchmark by Sommer et al. (2024) but also extend it to other datasets (Ionosphere, Income, IMDB, MNIST, F-MNIST) and models (convolutional and attention-based neural networks). Table 8: Overview of the used datasets with task description and references. ABBREV. DATA SET TASK # OBS. FEAT. REFERENCE A AIRFOIL REGRESSION 1503 5 DUA & GRAFF (2017) B BIKESHARING REGRESSION 17379 13 FANAEE-T (2013) C CONCRETE REGRESSION 1030 8 YEH (1998) E ENERGY REGRESSION 768 8 TSANAS & XIFARA (2012) P PROTEIN REGRESSION 45730 9 DUA & GRAFF (2017) Y YACHT REGRESSION 308 6 ORTIGOSA ET AL. (2007); DUA & GRAFF (2017) IONOSPHERE BINARY-CLASS. 351 34 SIGILLITO ET AL. (1989) INCOME BINARY-CLASS. 48842 14 KOHAVI (1996) IMDB BINARY-CLASS. 50000 TEXT MAAS ET AL. (2011) MNIST MULTI-CLASS. 60000 28X28 LECUN & CORTES (2010) F(ASHION)-MNIST MULTI-CLASS. 60000 28X28 XIAO ET AL. (2017)
Dataset Splits	Yes	We employ early stopping on a validation set and use a 70% train, 10% validation and 20% test split if there is no predefined test set as for the MNIST and Fashion MNIST dataset.
Hardware Specification	Yes	The experiments were run on two NVIDIA RTX A6000 GPUs and an AMD Ryzen Threadripper PRO 5000WX/3000WX CPU with 64 cores.
Software Dependencies	No	Our software is implemented in Python and mainly relies on the jax (Bradbury et al., 2018) and Black JAX (Cabezas et al., 2024) libraries. The paper mentions software names but does not provide specific version numbers for these libraries.
Experiment Setup	Yes	For all DE optimizations, we use ADAM with decoupled weight decay (Loshchilov & Hutter, 2019) and use the negative log-likelihood loss as objective. We employ early stopping on a validation set and use a 70% train, 10% validation and 20% test split if there is no predefined test set as for the MNIST and Fashion MNIST dataset. If not specified otherwise we use 12 DE members and 12 chains. For all NUTS-based experiments, we use a burn-in of 100 samples and collect 1000 posterior samples with a target acceptance rate of 0.8. Also, we employ an isotropic standard Gaussian prior if not specified otherwise. For the larger CNN and ATT models, we therefore choose the standard isotropic Gaussians N(0, 0.1I) (CNNv2), N(0, 0.2I) (ATTv1,v2) and N(0, 0.4I) (ATTv3).