reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Uncertainty-Aware Decoding with Minimum Bayes Risk

Authors: Nico Daheim, Clara Meister, Thomas Möllenhoff, Iryna Gurevych

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Here, we demonstrate empirically that incorporating weight uncertainty can improve decoding. First, we provide brief experimental details and discuss how we learn weight uncertainty in 4.1 and 4.2. More details about our experiments are found in App. A. Then, we show results using prompted, finetuned and from-scratch-trained models in 4.3, where we explore different posteriors and model combination methods. 4.4 looks into the trade-off between performance and ensemble diversity and 4.5 Bayes risk for selective prediction. Finally, we show the scaling behavior of various methods in 4.6.
Researcher Affiliation	Academia	1Ubiquitous Knowledge Processing Lab (UKP Lab) Department of Computer Science and Hessian Center for AI (hessian.AI) Technical University of Darmstadt 2ETH Zurich 3RIKEN Center for Advanced Intelligence Project, Tokyo, Japan
Pseudocode	No	The paper discusses various algorithms and methods, referring to equations (e.g., Eq. 9, Eq. 10, Eq. 13), but does not provide structured pseudocode blocks or algorithms.
Open Source Code	Yes	We release our code publicly.1 1https://github.com/UKPLab/iclr2025-mbr-uncertainty
Open Datasets	Yes	Datasets. We use WMT14 (Bojar et al., 2014), IWSLT14 (Cettolo et al., 2014), afro MT (Reid et al., 2021), IWSLT17 (Cettolo et al., 2017), WMT18 (Bojar et al., 2018), and WMT19 (Barrault et al., 2019) for machine translation, XSUM (Narayan et al., 2018) and SAMSum (Gliwa et al., 2019) for summarization, E2E-NLG (Novikova et al., 2017) for data-to-text generation, and STS-B (Cer et al., 2017) for scoring. For the latter, the model outputs a string representation of its numerical prediction and MBR corresponds to an empirical mean of the numerical predictions (Lukasik et al., 2024). All data usages can be reproduced by following the instructions from the Fairseq repository under https://github.com/facebookresearch/fairseq/tree/main/examples/translation and will be published along our code. For all datasets we use the versions from the huggingface hub (https://huggingface. co/).
Dataset Splits	Yes	Our usage of the WMT14 English-to-German translation tasks (Bojar et al., 2014) follows the set-up from (Vaswani et al., 2017) but augments the training data by the news-commentary-v12 data from WMT17 (Bojar et al., 2017). In total, we train on ca. 3.9M paired examples. We also use a validation set during training in order to pick checkpoints which consists of ca 39.4K examples. We use the original newstest2014 data which consists of 3,003 examples for evaluation. We also use the IWSLT14 German-to-English translation task (Cettolo et al., 2014) which consists of ca 160K training examples. The validation set consists of ca. 7.3K examples. The test set consists of 6,750K examples. Furthermore, we use two language pairs from Afro MT (Reid et al., 2021), namely En-Bem (English Bemba) which consists of 275K training, 3K validation, and 3K test examples. We do not use any monolingual data but only train from scratch on the parallel data. We use En-Run (English-Rundi) in the same way, which consists of 253K training, 3k validation, and 3k test examples. We use the En-De split of the IWSLT17 evaluation campaign (https://huggingface.co/ datasets/IWSLT/iwslt2017) (Cettolo et al., 2017) with 206,122 training and 8079 test examples and the WMT18 Tr-En split (https://huggingface.co/datasets/wmt/wmt18) (Bojar et al., 2018) with 205,756 training and 3,000 test examples for machine translation. For summarization experiments, we use XSUM (https://huggingface.co/datasets/Edinburgh NLP/xsum) (Narayan et al., 2018) and SAMSum (https://huggingface.co/datasets/Samsung/samsum) (Gliwa et al., 2019). XSUM has 204,045 training examples we train only on the first 50% to reduce computational load and 11,334 test examples. SAMSum is much smaller and consists only of 14,732 train and 819 test examples. Finally, we use E2E-NLG (https://huggingface.co/datasets/tuetschek/ e2e nlg) (Novikova et al., 2017) with 33,524 train and 1,846 test examples for data-to-text generation, as well as STS-B (https://huggingface.co/datasets/sentence-transformers/stsb) (Cer et al., 2017) with 5,749 train and 1,379 test examples for sentence similarity scoring.
Hardware Specification	Yes	All results were obtained on NVIDIA GeForce RTX 3090 GPUs with 24GB memory.
Software Dependencies	No	We train all models from scratch using the fairseq library (Ott et al., 2019) which we extend for variational learning and a Bayesian interpretation of neural networks. We use the chat template provided with huggingface (Wolf et al., 2020), which we adapt to organize our experiments in line with the Apache 2.0 license it is distributed under, to organize training and decoding.
Experiment Setup	Yes	We use the variational learning algorithm IVON (Shen et al., 2024) to estimate a posterior distribution over model weights and model weight uncertainty. We use IVON with a isotropic Gaussian prior and initialize all entries of the Hessian with 0.1. We use an effective sample size of 1 10 8, a small weight-decay of 0.0001, and a learning rate of 0.1. We set β1 = 0.9 and β2 = 0.9999. All models are trained with a batch size of 32 or up to 1024 tokens and we use 2 MC samples from the posterior during training for afro MT and IWSLT2014. For WMT14 we just use one MC sample due to the heavier compute requirements. We clip gradients elementwise at 0.001 and use a dropout rate of 0.2. We train the models until performance in terms of BLEU has not improved for at least 3 epochs and then stop with the exception for WMT14, where we train only up to 20 epochs. Following prior work, we use a length-penalty of 0.6 for decoding (Vaswani et al., 2017). We finetune the model using Lo RA (Hu et al., 2022) with a rank r = 8, α = 32 and a dropout rate of 0.1. We use an initial learning rate of 0.03 which we anneal to 0 with a cosine decay. We set β1 = 0.9, β2 = 0.99999, and use a small weight decay of 10 6. We again clip gradients to unit norm and element-wise with a maximum value of 0.001. All hessian values are initialized at 0.0003. We set the effective sample size (or inverse temperature) to 107 for training but 109 for decoding, because we have found this to perform better empirically, potentially due to the cold posterior effect (Wenzel et al., 2020). For training with Adam W, we set (β1, β2) = (0.9, 0.999) and perform a sweep over learning rates {1 10 5, 1 10 4, 5 10 4}. We again anneal the learning rates to 0, set a small weight decay of 10 6 and rescale gradients to unit norm but do not clip them element-wise. We train for 1 epoch for IWSLT17 and XSUM, 5 epochs for E2ENLG, 2 epochs for WMT18, and for 4 epochs on Sam SUM. We always take the final checkpoints after training has ended.