reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Theoretical guarantees on the best-of-n alignment policy

Authors: Ahmad Beirami, Alekh Agarwal, Jonathan Berant, Alexander D’Amour, Jacob Eisenstein, Chirag Nagpal, Ananda Theertha Suresh

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We also explore the tightness of this upper bound in different regimes, and propose a new estimator for the KL divergence and empirically show that it provides a tight approximation. We conclude with analyzing the tradeoffs between win rate and KL divergence of the best-of-n alignment policy, which demonstrate that very good tradeoffs are achievable with n < 1000. In what follows we numerically inspect the proposed estimator in a few scenarios, and compare it with the analytical formula and the exact KL divergence between the best-of-n policy and the reference policy. The first set of examples, in Figure 2, are uniform distributions over alphabets of varying sizes.
Researcher Affiliation	Industry	1Google Deep Mind 2Google Research. Correspondence to: Ahmad Beirami <EMAIL>, Ananda Theertha Suresh <EMAIL>.
Pseudocode	No	The paper describes mathematical derivations and theoretical bounds, but it does not contain any explicitly labeled pseudocode or algorithm blocks.
Open Source Code	No	The paper does not contain any explicit statements about releasing source code or provide links to a code repository for the methodology described.
Open Datasets	Yes	In Figure 4, we compare the estimates for four cherry picked examples from the Alpaca dataset (Taori et al., 2023) using Gemma 9B IT model (Gemma et al., 2024)...
Dataset Splits	No	The paper mentions using "four cherry picked examples from the Alpaca dataset", which implies specific instances were selected, rather than defining formal training, validation, or test splits for reproducibility of data partitioning.
Hardware Specification	No	The paper does not provide specific details about the hardware (e.g., CPU, GPU models, memory) used to run the experiments. It only refers to a model: "using Gemma 9B IT model (Gemma et al., 2024)".
Software Dependencies	No	The paper mentions using the "Gemma 9B IT model (Gemma et al., 2024)" but does not specify any software libraries, frameworks, or their version numbers that would be necessary to replicate the experiments.
Experiment Setup	No	The paper mentions experimental conditions such as "with reward the log-likelihood of response under the reference model" and "with temperature one" for the Gemma model, or "reward the negative of length". However, it does not provide hyperparameters or system-level training settings for a model that was trained as part of this research, as the work is primarily theoretical with numerical evaluations.