BOND: Aligning LLMs with Best-of-N Distillation
Authors: Pier Giuseppe Sessa, Robert Dadashi, Léonard Hussenot-Desenonges, Johan Ferret, Nino Vieillard, Alexandre Rame, Bobak Shahriari, Sarah Perrin, Abram Friesen, Geoffrey Cideron, Sertan Girgin, Piotr Stanczyk, Andrea Michi, Danila Sinopalnikov, Sabela Ramos Garea, Amélie Héliou, Aliaksei Severyn, Matthew Hoffman, Nikola Momchev, Olivier Bachem
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We demonstrate the effectiveness of our approach and several design choices through experiments on abstractive summarization and Gemma models. |
| Researcher Affiliation | Industry | Pier Giuseppe Sessa , Robert Dadashi, L eonard Hussenot, Johan Ferret, Nino Vieillard Alexandre Ram e, Bobak Shahriari, Sarah Perrin, Abram L. Friesen, Geoffrey Cideron Sertan Girgin, Piotr Stanczyk, Andrea Michi, Danila Sinopalnikov, Sabela Ramos Am elie H eliou, Aliaksei Severyn, Matt Hoffman, Nikola Momchev, Olivier Bachem Google Deep Mind |
| Pseudocode | Yes | Algorithm 1 Iterative BOND (meta algorithm) Inputs: πref, n N. Initialize π0 = πref, π0 anchor = πref. for t = 0, . . . , do */ iterations πt+1 = arg minπ Π D(π Best-of-n(πt anchor)) */ distill the Best-of-n version of πt anchor πt anchor = πt Algorithm 2 The J-BOND algorithm Inputs: Prompt dataset D, reference policy πref, reward r( ), β, η [0, 1], γ 0. Initialize policy and anchor π0 = π0 anchor = πref. for t = 0, . . . do |
| Open Source Code | No | The paper does not provide explicit statements about source code availability, a repository link, or mention of code in supplementary materials for the methodology described. |
| Open Datasets | Yes | Experiments. We first demonstrate the effectiveness of BOND and of our design choices on the abstractive summarization XSum (Narayan et al., 2018) task. Then, in Section 6, we apply J-BOND to align Gemma (Gemma Team, 2024) policies. Zero-shot performance on popular benchmarks including: GPQA (Rein et al., 2024), GSM8K (Cobbe et al., 2021), MATH (Hendrycks et al., 2021) and Big Bench Hard (BBH) (Suzgun et al., 2022) to test our policies on different capabilities. |
| Dataset Splits | No | The paper mentions using a 'batch size of 128' and a 'held-out collection of prompts' for evaluation, but does not provide specific details on how the datasets were split into training, validation, or test sets (e.g., percentages or sample counts). While it refers to standard benchmarks, it does not explicitly state the splits used within the main text. |
| Hardware Specification | No | The paper mentions using Gemma models (Gemma Team, 2024) but does not provide any specific details about the hardware (e.g., GPU models, CPU types, or cloud infrastructure specifications) used for running the experiments or training the models. |
| Software Dependencies | No | The paper mentions using the 'Adam optimizer (Kingma & Ba, 2015)' with specific learning rate and warm-up steps. However, it does not provide version numbers for any specific software libraries, frameworks (like PyTorch or TensorFlow), or other dependencies that would be needed to replicate the experiment. |
| Experiment Setup | Yes | We use a batch size of 128 and the Adam optimizer (Kingma & Ba, 2015) with learning rate 3e 6 and 100 warm-up steps. For the Jeffreys divergence objective, we set β = 0.5 (we ablate different Jeffreys divercences in Appendix B.3). For J-BOND we set the anchor mixing coefficient to η = 0.02. For REINFORCE, we test possible regularization strengths βRL {0.001, 0.01, 0.1, 1}. |