reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Better Instruction-Following Through Minimum Bayes Risk

Authors: Ian Wu, Patrick Fernandes, Amanda Bertsch, Seungone Kim, Sina Pakazad, Graham Neubig

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In the first part of this work, we explore using MBR decoding as a method for improving the test-time performance of instruction-following LLMs. We find that MBR decoding with reference-based LLM judges substantially improves over greedy decoding, best-of-N decoding with reference-free judges and MBR decoding with lexical and embedding-based metrics on Alpaca Eval and MT-Bench. These gains are consistent across LLMs with up to 70B parameters, demonstrating that smaller LLM judges can be used to supervise much larger LLMs.
Researcher Affiliation	Collaboration	Ian Wu1 Patrick Fernandes2,3,4 Amanda Bertsch2 Seungone Kim2 Sina Pakazad1 Graham Neubig2 1C3 AI 2Carnegie Mellon University
Pseudocode	Yes	Algorithm 1 MBR Inference and Algorithm 2 MBR Distillation with DPO
Open Source Code	No	In an effort to make our work reproducible, we document all prompts (Appendices E and B), as well as training and inference hyperparameters (Appendix K) used throughout our experiments. We also include version information for all API-based LLMs (Appendix B), and choose to use open-source models (the Llama-2, Llama-3, Prometheus-2 and Judge LM families) where possible.
Open Datasets	Yes	Alpaca Eval (Li et al., 2023) is an LLM-based evaluation metric. It consists of an 805-sample, highly diverse single-turn instruction-following conversational dataset... MT-Bench (Zheng et al., 2023) is an 80-sample, two-turn instruction-following conversational dataset. ... We use 3000 random samples from Ultra Chat (Ding et al., 2023) for SFT.
Dataset Splits	Yes	We use 3000 random samples from Ultra Chat (Ding et al., 2023) for SFT. ... We start by randomly drawing a further 3000 prompts from Ultra Chat (excluding the samples that have already been selected for SFT).
Hardware Specification	Yes	For inference, we use 4x A100 GPUs with bf16 quantisation for all LLMs and judge LLMs, other than for the Analysis of compute costs experiments in Section 4.2, where we use 2x A100 GPUs. We use v LLM (Kwon et al., 2023) as the inference engine for all experiments. ... We use bf16 mixed precision training with 8x A100 GPUs for all experiments.
Software Dependencies	No	We use v LLM (Kwon et al., 2023) as the inference engine for all experiments.
Experiment Setup	Yes	We use the chat and instruct variants of the Llama-2 (Touvron et al., 2023b) and Llama-3 (Dubey et al., 2024) models in this experiment. All models have undergone prior SFT and demonstrate strong instruction-following and conversation abilities. We generate Ncand 30 candidates using temperature sampling with t 0.3 for all MBR decoding experiments unless otherwise specified. ... For MBR decoding, we use Ncand 30 and t 0.3 with Prometheus-2-7B as the utility metric. ... Appendix K: TRAINING AND INFERENCE HYPERPARAMETERS (Tables 21 and 22 listing specific values for Learning Rate, Num Epochs, Batch Size, Optimiser, etc.)