reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

BOND: Aligning LLMs with Best-of-N Distillation

Authors: Pier Giuseppe Sessa, Robert Dadashi, Léonard Hussenot-Desenonges, Johan Ferret, Nino Vieillard, Alexandre Rame, Bobak Shahriari, Sarah Perrin, Abram Friesen, Geoffrey Cideron, Sertan Girgin, Piotr Stanczyk, Andrea Michi, Danila Sinopalnikov, Sabela Ramos Garea, Amélie Héliou, Aliaksei Severyn, Matthew Hoffman, Nikola Momchev, Olivier Bachem

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We demonstrate the effectiveness of our approach and several design choices through experiments on abstractive summarization and Gemma models.
Researcher Affiliation	Industry	Pier Giuseppe Sessa , Robert Dadashi, L eonard Hussenot, Johan Ferret, Nino Vieillard Alexandre Ram e, Bobak Shahriari, Sarah Perrin, Abram L. Friesen, Geoffrey Cideron Sertan Girgin, Piotr Stanczyk, Andrea Michi, Danila Sinopalnikov, Sabela Ramos Am elie H eliou, Aliaksei Severyn, Matt Hoffman, Nikola Momchev, Olivier Bachem Google Deep Mind
Pseudocode	Yes	Algorithm 1 Iterative BOND (meta algorithm) Inputs: πref, n N. Initialize π0 = πref, π0 anchor = πref. for t = 0, . . . , do / iterations πt+1 = arg minπ Π D(π Best-of-n(πt anchor)) / distill the Best-of-n version of πt anchor πt anchor = πt Algorithm 2 The J-BOND algorithm Inputs: Prompt dataset D, reference policy πref, reward r( ), β, η [0, 1], γ 0. Initialize policy and anchor π0 = π0 anchor = πref. for t = 0, . . . do
Open Source Code	No	The paper does not provide explicit statements about source code availability, a repository link, or mention of code in supplementary materials for the methodology described.
Open Datasets	Yes	Experiments. We first demonstrate the effectiveness of BOND and of our design choices on the abstractive summarization XSum (Narayan et al., 2018) task. Then, in Section 6, we apply J-BOND to align Gemma (Gemma Team, 2024) policies. Zero-shot performance on popular benchmarks including: GPQA (Rein et al., 2024), GSM8K (Cobbe et al., 2021), MATH (Hendrycks et al., 2021) and Big Bench Hard (BBH) (Suzgun et al., 2022) to test our policies on different capabilities.
Dataset Splits	No	The paper mentions using a 'batch size of 128' and a 'held-out collection of prompts' for evaluation, but does not provide specific details on how the datasets were split into training, validation, or test sets (e.g., percentages or sample counts). While it refers to standard benchmarks, it does not explicitly state the splits used within the main text.
Hardware Specification	No	The paper mentions using Gemma models (Gemma Team, 2024) but does not provide any specific details about the hardware (e.g., GPU models, CPU types, or cloud infrastructure specifications) used for running the experiments or training the models.
Software Dependencies	No	The paper mentions using the 'Adam optimizer (Kingma & Ba, 2015)' with specific learning rate and warm-up steps. However, it does not provide version numbers for any specific software libraries, frameworks (like PyTorch or TensorFlow), or other dependencies that would be needed to replicate the experiment.
Experiment Setup	Yes	We use a batch size of 128 and the Adam optimizer (Kingma & Ba, 2015) with learning rate 3e 6 and 100 warm-up steps. For the Jeffreys divergence objective, we set β = 0.5 (we ablate different Jeffreys divercences in Appendix B.3). For J-BOND we set the anchor mixing coefficient to η = 0.02. For REINFORCE, we test possible regularization strengths βRL {0.001, 0.01, 0.1, 1}.