Variational Best-of-N Alignment
Authors: Afra Amini, Tim Vieira, Elliott Ash, Ryan Cotterell
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our experiments on controlled generation and summarization tasks show that Bo N is the most effective alignment method, and our variational approximation to Bo N achieves the closest performance to Bo N and surpasses models fine-tuned using the standard KL-constrained RL objective. In the controlled generation task, v Bo N appears more frequently on the Pareto frontier of reward and KL divergence compared to other alignment methods. In the summarization task, v Bo N achieves high reward values across various sampling temperatures. |
| Researcher Affiliation | Academia | Afra Amini Tim Vieira Elliott Ash Ryan Cotterell ETH Z urich EMAIL EMAIL EMAIL |
| Pseudocode | Yes | Algorithm 1 The v Bo N algorithm |
| Open Source Code | Yes | https://github.com/rycolab/vbon |
| Open Datasets | Yes | The reference model, πref, is GPT-IMDB9, a GPT-2 (Radford et al., 2019) model fine-tuned on IMDB corpus (Maas et al., 2011). We use a binary sentiment classifier,10 denoted as p, with two classes {POS, NEG} as the reward model, and define r(y) def= p(POS | y). Following Rafailov et al. (2023), we sample 5000 movie reviews from the training set of IMDB dataset and for each sample, we randomly choose a prefix length from {2,... , 8} and take that prefix as the prompt. |
| Dataset Splits | Yes | We sample 5000 movie reviews from the training set of IMDB dataset and for each sample, we randomly choose a prefix length from {2,... , 8} and take that prefix as the prompt. We further generate 512 prompts in the same way from the test set of IMDB that we use to evaluate our models. |
| Hardware Specification | Yes | Figure 4: The average reward and win rate of the aligned models improve as we increase the sample size M used for approximating the v Bo N loss function. Performance on a single A100-40GB GPU. |
| Software Dependencies | No | We use the default hyperparameters in trlx library (Havrilla et al., 2023) for fine-tuning with PPO. We implement and compare the following existing methods for language model alignment: Bo N-SFT: Perhaps the most straightforward way to approximate Bo N distribution is to fine-tune the model to maximize the likelihood of the samples taken with Bo N algorithm. Unfortunately, we find that SFT is incapable of achieving a good trade-off between achieving high rewards and low KL divergence, see App. H (Fig. 7) for the experimental results. PPO: We use PPO to optimize the KL-constrained objective in Eq. (1). |
| Experiment Setup | Yes | Hypterparameter Value Episodes 10000 Optimizer Adam W (ϵ = 1e 5, lr= 3e 6) Scheduler Linear Batch Size 32 β (Both for v Bo N and KL-constrained RL objective) 0.05 γ (Discount Factor) 1 λ (for GAE) 0.95 Number of PPO Update Iteration Per Epoch 4 PPO s Policy Clipping Coefficient 0.2 Value Clipping Coefficient 0.2 Value Function Coefficient 0.2 Value Function Loss Clipping True Sampling Temperature 0.7 |