Better Instruction-Following Through Minimum Bayes Risk
Authors: Ian Wu, Patrick Fernandes, Amanda Bertsch, Seungone Kim, Sina Pakazad, Graham Neubig
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In the first part of this work, we explore using MBR decoding as a method for improving the test-time performance of instruction-following LLMs. We find that MBR decoding with reference-based LLM judges substantially improves over greedy decoding, best-of-N decoding with reference-free judges and MBR decoding with lexical and embedding-based metrics on Alpaca Eval and MT-Bench. These gains are consistent across LLMs with up to 70B parameters, demonstrating that smaller LLM judges can be used to supervise much larger LLMs. |
| Researcher Affiliation | Collaboration | Ian Wu1 Patrick Fernandes2,3,4 Amanda Bertsch2 Seungone Kim2 Sina Pakazad1 Graham Neubig2 1C3 AI 2Carnegie Mellon University |
| Pseudocode | Yes | Algorithm 1 MBR Inference and Algorithm 2 MBR Distillation with DPO |
| Open Source Code | No | In an effort to make our work reproducible, we document all prompts (Appendices E and B), as well as training and inference hyperparameters (Appendix K) used throughout our experiments. We also include version information for all API-based LLMs (Appendix B), and choose to use open-source models (the Llama-2, Llama-3, Prometheus-2 and Judge LM families) where possible. |
| Open Datasets | Yes | Alpaca Eval (Li et al., 2023) is an LLM-based evaluation metric. It consists of an 805-sample, highly diverse single-turn instruction-following conversational dataset... MT-Bench (Zheng et al., 2023) is an 80-sample, two-turn instruction-following conversational dataset. ... We use 3000 random samples from Ultra Chat (Ding et al., 2023) for SFT. |
| Dataset Splits | Yes | We use 3000 random samples from Ultra Chat (Ding et al., 2023) for SFT. ... We start by randomly drawing a further 3000 prompts from Ultra Chat (excluding the samples that have already been selected for SFT). |
| Hardware Specification | Yes | For inference, we use 4x A100 GPUs with bf16 quantisation for all LLMs and judge LLMs, other than for the Analysis of compute costs experiments in Section 4.2, where we use 2x A100 GPUs. We use v LLM (Kwon et al., 2023) as the inference engine for all experiments. ... We use bf16 mixed precision training with 8x A100 GPUs for all experiments. |
| Software Dependencies | No | We use v LLM (Kwon et al., 2023) as the inference engine for all experiments. |
| Experiment Setup | Yes | We use the chat and instruct variants of the Llama-2 (Touvron et al., 2023b) and Llama-3 (Dubey et al., 2024) models in this experiment. All models have undergone prior SFT and demonstrate strong instruction-following and conversation abilities. We generate Ncand 30 candidates using temperature sampling with t 0.3 for all MBR decoding experiments unless otherwise specified. ... For MBR decoding, we use Ncand 30 and t 0.3 with Prometheus-2-7B as the utility metric. ... Appendix K: TRAINING AND INFERENCE HYPERPARAMETERS (Tables 21 and 22 listing specific values for Learning Rate, Num Epochs, Batch Size, Optimiser, etc.) |