Microcanonical Langevin Ensembles: Advancing the Sampling of Bayesian Neural Networks

Authors: Emanuel Sommer, Jakob Robnik, Giorgi Nozadze, Uros Seljak, David RĂ¼gamer

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this section, we evaluate the feasibility of applying MILE to sampling-based inference for BNNs. Datasets and models: We replicate the benchmark by Sommer et al. (2024) but also extend it to other datasets (Ionosphere, Income, IMDB, MNIST, F-MNIST) and models (convolutional and attention-based neural networks). Methods: As in Sommer et al. (2024), we investigate the improvement of our proposed approach over a DE, but also compare against the current state-of-the-art BDE approach based on the NUTS sampler. Runtime comparisons: Following this, we conduct a series of ablation studies, carefully examining how MILE scales compared to BDE. Tuning and hyperparameters: We finally validate the robustness of MILE s hyperparameters, supporting our claim that MILE is an auto-tuned off-the-shelf procedure like NUTS.
Researcher Affiliation Collaboration Emanuel Sommer Department of Statistics, LMU Munich Munich Center for Machine Learning (MCML) Munich, Germany EMAIL Jakob Robnik Physics Department University of California Berkeley, USA jakob EMAIL Giorgi Nozadze Department of Statistics, LMU Munich Eraneos Analytics Germany Gmb H Munich, Germany EMAIL
Pseudocode No The paper includes Figure 1: Flowchart illustrating our proposed procedure for obtaining a Microcanonical Langevin Ensemble (MILE) for BNNs. The process involves three main stages: optimization, MCLMC warmup and tuning, and MCLMC sampling. These steps are parallelized to generate an ensemble of K members. The number of MCLMC steps for each tuning phase and the final sampling phase are annotated, and carryovers between stages are highlighted in circles. This is a flowchart, not structured pseudocode or an algorithm block.
Open Source Code Yes Our code is available at https://github.com/EmanuelSommer/MILE.
Open Datasets Yes Datasets and models: We replicate the benchmark by Sommer et al. (2024) but also extend it to other datasets (Ionosphere, Income, IMDB, MNIST, F-MNIST) and models (convolutional and attention-based neural networks). Table 8: Overview of the used datasets with task description and references. ABBREV. DATA SET TASK # OBS. FEAT. REFERENCE A AIRFOIL REGRESSION 1503 5 DUA & GRAFF (2017) B BIKESHARING REGRESSION 17379 13 FANAEE-T (2013) C CONCRETE REGRESSION 1030 8 YEH (1998) E ENERGY REGRESSION 768 8 TSANAS & XIFARA (2012) P PROTEIN REGRESSION 45730 9 DUA & GRAFF (2017) Y YACHT REGRESSION 308 6 ORTIGOSA ET AL. (2007); DUA & GRAFF (2017) IONOSPHERE BINARY-CLASS. 351 34 SIGILLITO ET AL. (1989) INCOME BINARY-CLASS. 48842 14 KOHAVI (1996) IMDB BINARY-CLASS. 50000 TEXT MAAS ET AL. (2011) MNIST MULTI-CLASS. 60000 28X28 LECUN & CORTES (2010) F(ASHION)-MNIST MULTI-CLASS. 60000 28X28 XIAO ET AL. (2017)
Dataset Splits Yes We employ early stopping on a validation set and use a 70% train, 10% validation and 20% test split if there is no predefined test set as for the MNIST and Fashion MNIST dataset.
Hardware Specification Yes The experiments were run on two NVIDIA RTX A6000 GPUs and an AMD Ryzen Threadripper PRO 5000WX/3000WX CPU with 64 cores.
Software Dependencies No Our software is implemented in Python and mainly relies on the jax (Bradbury et al., 2018) and Black JAX (Cabezas et al., 2024) libraries. The paper mentions software names but does not provide specific version numbers for these libraries.
Experiment Setup Yes For all DE optimizations, we use ADAM with decoupled weight decay (Loshchilov & Hutter, 2019) and use the negative log-likelihood loss as objective. We employ early stopping on a validation set and use a 70% train, 10% validation and 20% test split if there is no predefined test set as for the MNIST and Fashion MNIST dataset. If not specified otherwise we use 12 DE members and 12 chains. For all NUTS-based experiments, we use a burn-in of 100 samples and collect 1000 posterior samples with a target acceptance rate of 0.8. Also, we employ an isotropic standard Gaussian prior if not specified otherwise. For the larger CNN and ATT models, we therefore choose the standard isotropic Gaussians N(0, 0.1I) (CNNv2), N(0, 0.2I) (ATTv1,v2) and N(0, 0.4I) (ATTv3).