InfAlign: Inference-aware language model alignment

Authors: Ananth Balashankar, Ziteng Sun, Jonathan Berant, Jacob Eisenstein, Michael Collins, Adrian Hutter, Jong Lee, Chirag Nagpal, Flavien Prost, Aradhana Sinha, Ananda Theertha Suresh, Ahmad Beirami

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Empirically, we apply Inf Align-CTRL to the Anthropic helpfulness (Bai et al., 2022), and Reddit summarization dataset (Stiennon et al., 2020) for optimizing Bo N perfor- mance and Anthropic harmlessness for optimizing Wo N performance with various values of N. We show that solving Inf Align through Inf Align-CTRL outperforms various SOTA RLHF solvers at inference-time win rate by 3-8%.
Researcher Affiliation Industry 1Google Deep Mind 2Google Research. Correspondence to: Ananth Balashankar <EMAIL>, Ziteng Sun <EMAIL>, Jonathan Berant <EMAIL>, Jacob Eisenstein <EMAIL>, Ananda Theertha Suresh <EMAIL>, Ahmad Beirami <EMAIL>.
Pseudocode Yes Algorithm 1 Implementation of Inf Align-CTRL Require: Base policy πref, (uncalibrated) reward model r, set of training prompts D X, number of offline rollouts per prompt K, transformation function Φ. 1: Compute empirical calibrated reward b Cr,πref using Eq. (12) with K offline rollouts per x D. 2: Transform calibrated reward using function Φ to get RΦ = Φ b Cr,πref 3: Optimize RLHF using calibrated and transformed reward per Eq. (4) using PPO.
Open Source Code No The paper does not provide an explicit statement or link to an open-source code repository for the methodology described.
Open Datasets Yes We consider the following tasks: (1) Anthropic Helpfulness and Harmlessness datasets (Bai et al., 2022), which involve multi-turn dialogues between a human and a digital assistant. For training the reward models, the preference datasets consist of two responses for one context, and a label for the human preference for the response. We use the train split of the two datasets (44K examples for helpfulness and 42K for harmlessness) to train the uncalibrated and calibrated reward models separate reward models for each objective. (2) Similarly, for the summarization quality task, we use Reddit posts from TL;DR dataset (Stiennon et al., 2020) and train uncalibrated and calibrated reward models on the train split.
Dataset Splits Yes We use the train split of the two datasets (44K examples for helpfulness and 42K for harmlessness) to train the uncalibrated and calibrated reward models separate reward models for each objective. (2) Similarly, for the summarization quality task, we use Reddit posts from TL;DR dataset (Stiennon et al., 2020) and train uncalibrated and calibrated reward models on the train split. ... We report win rate against on the test split as measured by the Pa LM-2 M reward model trained on the corresponding datasets.
Hardware Specification No The paper mentions using 'PaLM-2 S model' and 'PaLM-2 M model' which are language models, but does not specify any hardware details like GPU/CPU models or types.
Software Dependencies No The paper mentions using PPO (Schulman et al., 2017) as the optimization algorithm but does not specify version numbers for any software dependencies or libraries.
Experiment Setup Yes We set K = 100 in our experiments, and analyze the additional computational overhead in the Appendix. ... For each of the runs, we experiment with different KL-regularizer strengths (β {0.01, . . . , 0.09}) and obtain the Pareto-curve of the KL divergence vs {standard, Bo N, Wo N } win rate curves.