reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

InfAlign: Inference-aware language model alignment

Authors: Ananth Balashankar, Ziteng Sun, Jonathan Berant, Jacob Eisenstein, Michael Collins, Adrian Hutter, Jong Lee, Chirag Nagpal, Flavien Prost, Aradhana Sinha, Ananda Theertha Suresh, Ahmad Beirami

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Empirically, we apply Inf Align-CTRL to the Anthropic helpfulness (Bai et al., 2022), and Reddit summarization dataset (Stiennon et al., 2020) for optimizing Bo N perfor- mance and Anthropic harmlessness for optimizing Wo N performance with various values of N. We show that solving Inf Align through Inf Align-CTRL outperforms various SOTA RLHF solvers at inference-time win rate by 3-8%.
Researcher Affiliation	Industry	1Google Deep Mind 2Google Research. Correspondence to: Ananth Balashankar <EMAIL>, Ziteng Sun <EMAIL>, Jonathan Berant <EMAIL>, Jacob Eisenstein <EMAIL>, Ananda Theertha Suresh <EMAIL>, Ahmad Beirami <EMAIL>.
Pseudocode	Yes	Algorithm 1 Implementation of Inf Align-CTRL Require: Base policy πref, (uncalibrated) reward model r, set of training prompts D X, number of offline rollouts per prompt K, transformation function Φ. 1: Compute empirical calibrated reward b Cr,πref using Eq. (12) with K offline rollouts per x D. 2: Transform calibrated reward using function Φ to get RΦ = Φ b Cr,πref 3: Optimize RLHF using calibrated and transformed reward per Eq. (4) using PPO.
Open Source Code	No	The paper does not provide an explicit statement or link to an open-source code repository for the methodology described.
Open Datasets	Yes	We consider the following tasks: (1) Anthropic Helpfulness and Harmlessness datasets (Bai et al., 2022), which involve multi-turn dialogues between a human and a digital assistant. For training the reward models, the preference datasets consist of two responses for one context, and a label for the human preference for the response. We use the train split of the two datasets (44K examples for helpfulness and 42K for harmlessness) to train the uncalibrated and calibrated reward models separate reward models for each objective. (2) Similarly, for the summarization quality task, we use Reddit posts from TL;DR dataset (Stiennon et al., 2020) and train uncalibrated and calibrated reward models on the train split.
Dataset Splits	Yes	We use the train split of the two datasets (44K examples for helpfulness and 42K for harmlessness) to train the uncalibrated and calibrated reward models separate reward models for each objective. (2) Similarly, for the summarization quality task, we use Reddit posts from TL;DR dataset (Stiennon et al., 2020) and train uncalibrated and calibrated reward models on the train split. ... We report win rate against on the test split as measured by the Pa LM-2 M reward model trained on the corresponding datasets.
Hardware Specification	No	The paper mentions using 'PaLM-2 S model' and 'PaLM-2 M model' which are language models, but does not specify any hardware details like GPU/CPU models or types.
Software Dependencies	No	The paper mentions using PPO (Schulman et al., 2017) as the optimization algorithm but does not specify version numbers for any software dependencies or libraries.
Experiment Setup	Yes	We set K = 100 in our experiments, and analyze the additional computational overhead in the Appendix. ... For each of the runs, we experiment with different KL-regularizer strengths (β {0.01, . . . , 0.09}) and obtain the Pareto-curve of the KL divergence vs {standard, Bo N, Wo N } win rate curves.