Variational Best-of-N Alignment

Authors: Afra Amini, Tim Vieira, Elliott Ash, Ryan Cotterell

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our experiments on controlled generation and summarization tasks show that Bo N is the most effective alignment method, and our variational approximation to Bo N achieves the closest performance to Bo N and surpasses models fine-tuned using the standard KL-constrained RL objective. In the controlled generation task, v Bo N appears more frequently on the Pareto frontier of reward and KL divergence compared to other alignment methods. In the summarization task, v Bo N achieves high reward values across various sampling temperatures.
Researcher Affiliation Academia Afra Amini Tim Vieira Elliott Ash Ryan Cotterell ETH Z urich EMAIL EMAIL EMAIL
Pseudocode Yes Algorithm 1 The v Bo N algorithm
Open Source Code Yes https://github.com/rycolab/vbon
Open Datasets Yes The reference model, πref, is GPT-IMDB9, a GPT-2 (Radford et al., 2019) model fine-tuned on IMDB corpus (Maas et al., 2011). We use a binary sentiment classifier,10 denoted as p, with two classes {POS, NEG} as the reward model, and define r(y) def= p(POS | y). Following Rafailov et al. (2023), we sample 5000 movie reviews from the training set of IMDB dataset and for each sample, we randomly choose a prefix length from {2,... , 8} and take that prefix as the prompt.
Dataset Splits Yes We sample 5000 movie reviews from the training set of IMDB dataset and for each sample, we randomly choose a prefix length from {2,... , 8} and take that prefix as the prompt. We further generate 512 prompts in the same way from the test set of IMDB that we use to evaluate our models.
Hardware Specification Yes Figure 4: The average reward and win rate of the aligned models improve as we increase the sample size M used for approximating the v Bo N loss function. Performance on a single A100-40GB GPU.
Software Dependencies No We use the default hyperparameters in trlx library (Havrilla et al., 2023) for fine-tuning with PPO. We implement and compare the following existing methods for language model alignment: Bo N-SFT: Perhaps the most straightforward way to approximate Bo N distribution is to fine-tune the model to maximize the likelihood of the samples taken with Bo N algorithm. Unfortunately, we find that SFT is incapable of achieving a good trade-off between achieving high rewards and low KL divergence, see App. H (Fig. 7) for the experimental results. PPO: We use PPO to optimize the KL-constrained objective in Eq. (1).
Experiment Setup Yes Hypterparameter Value Episodes 10000 Optimizer Adam W (ϵ = 1e 5, lr= 3e 6) Scheduler Linear Batch Size 32 β (Both for v Bo N and KL-constrained RL objective) 0.05 γ (Discount Factor) 1 λ (for GAE) 0.95 Number of PPO Update Iteration Per Epoch 4 PPO s Policy Clipping Coefficient 0.2 Value Clipping Coefficient 0.2 Value Function Coefficient 0.2 Value Function Loss Clipping True Sampling Temperature 0.7