Scalable Bayesian Learning with posteriors
Authors: Samuel Duffield, Kaelan Donatella, Johnathan Chiu, Phoebe Klett, Daniel Simpson
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this work, we (i) introduce posteriors, an easily extensible Py Torch library hosting general-purpose implementations making Bayesian learning accessible and scalable to large data and parameter regimes; (ii) present a tempered framing of stochastic gradient Markov chain Monte Carlo, as implemented in posteriors, that transitions seamlessly into optimization and unveils a minor modification to deep ensembles to ensure they are asymptotically unbiased for the Bayesian posterior, and (iii) demonstrate and compare the utility of Bayesian approximations through experiments including an investigation into the cold posterior effect and applications with large language models. |
| Researcher Affiliation | Industry | Samuel Duffield Normal Computing Kaelan Donatella Normal Computing Johnathan Chiu Normal Computing Phoebe Klett Normal Computing Daniel Simpson Normal Computing |
| Pseudocode | Yes | Figure 3: posteriors code snippet to train a classifier with variational inference. posteriors recommends normalising the log posterior across the batch with scale independent of batch size or N. Scaling the prior and temperature by N 1 ensures the posterior is still correctly targeted. |
| Open Source Code | Yes | posteriors: github.com/normal-computing/posteriors |
| Open Datasets | Yes | Following Wenzel et al. (2020); Izmailov et al. (2021), we train a CNN-LSTM model with 2.7 million parameters on the IMDB (Maas et al., 2011) for binary classification of positive or negative reviews. Our dataset Rae et al. (2019) for the continual learning experiment, a collection of long books, is divided into N episodes of train and test data. We fine-tune the last attention layer of the 8B Llama 3 model (AI@Meta, 2024) (resulting in 218 million trainable parameters) on the TQA (Kembhavi et al., 2017) dataset, which consists of scientific textbooks. |
| Dataset Splits | Yes | We train on the IMDB Maas et al. (2011) dataset for binary classification of positive/negative sentiment. We follow the default 50-50 split for the dataset with 25 thousand samples for training and 25 thousand samples for testing. Our dataset Rae et al. (2019) for the continual learning experiment, a collection of long books, is divided into N episodes of train and test data. In the results reported in Figures 5 and 6, we use 1 book per episode, holding out the last 15%. |
| Hardware Specification | Yes | All cold posterior simulations were run on an NVIDIA A100, and all simulations (including repeats over 5 random seeds) take 1 day to run. Continual Lo RA simulations were run on an NVIDIA A100; simulations over the 20 books take 6 hours to run. Bayesian Llama 3 simulations were run on an NVIDIA A100; training of the SGD and ensemble approaches take 16 hours to run whilst evaluation over the 100 statements is fast. |
| Software Dependencies | No | posteriors is written in Py Torch (Paszke et al., 2019) and has compatibility with a broad range of tools including the Llama (Touvron et al., 2023) and Mistral (Jiang et al., 2023) models, lightning (Falcon and The Py Torch Lightning team, 2019) for convenient logging and device management, optimizers from torchopt (Ren et al., 2023) and probabilistic programs from pyro (Bingham et al., 2019), see Appendix G. |
| Experiment Setup | Yes | All methods used batch size 32. We train using the Adam W optimizer (Loshchilov and Hutter, 2018) with all hyperparameters set to the detaults from Torch Opt (Ren et al., 2023) (learning rate 10^-3). We use a single sample Monte Carlo estimate at each step. We train the variational variances in log space to avoid negative variances and initialise all log standard deviations to -3. We train with learning rate 0.1 and friction α = 1. We run for 60 epochs (which is longer as many samples are collected along a single trajectory). We apply a burn-in of 20000 thousand iterations and then collect samples every 1000 iterations, resulting in a final collection of 27 samples. We use rank r = 8 and α = 32 for Lo RA, the standard settings (Hu et al., 2021), and finetune three weight matrices in the last layer (key, query, and output projections), following the literature (Yang et al., 2024). We set the learning rate to be 10^-3. We set alpha and beta parameters to 10^-2 and 0, respectively, with the momenta initialized to 0. For the textbook content, we tokenize with a stride length of 300 and a stride overlap of size 100 while using a batch size of 10. During training, we mask the loss on the first 250 tokens and only consider the loss on the last 50 tokens. All layers are frozen except for the final attention layer. |