reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Exploratory Preference Optimization: Harnessing Implicit Q*-Approximation for Sample-Efficient RLHF

Authors: Tengyang Xie, Dylan Foster, Akshay Krishnamurthy, Corby Rosset, Ahmed H Awadallah, Alexander Rakhlin

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Theoretical	This paper investigates a basic question in reinforcement learning from human feedback (RLHF) from a theoretical perspective: how to efficiently explore in an online manner under preference feedback and general function approximation. We take the initial step towards a theoretical understanding of this problem by proposing a novel algorithm, Exploratory Preference Optimization (XPO). We prove that XPO is provably sample-efficient and converges to a near-optimal policy under natural exploration conditions, regardless of the initial model s coverage. Our analysis builds on the observation that DPO implicitly performs a form of Bellman error minimization.
Researcher Affiliation	Collaboration	Tengyang Xie UW-Madison EMAIL; Dylan J. Foster Microsoft Research EMAIL; Akshay Krishnamurthy Microsoft Research EMAIL; Corby Rosset Microsoft Research EMAIL; Ahmed Awadallah Microsoft Research EMAIL; Alexander Rakhlin MIT EMAIL
Pseudocode	Yes	Algorithm 1 Exploratory Preference Optimization (XPO) is displayed in Algorithm 1. The algorithm takes as input a user-specified policy class Π and proceeds in almost the same fashion as Online DPO. [...] Algorithm 2 Exploratory Preference Optimization (XPO) with general sampling policy. [...] Algorithm 3 Exploratory Preference Optimization (XPO) with historical sampling. [...] Algorithm 4 Exploratory Preference Optimization (XPO) with general sampling policy and large batch size.
Open Source Code	No	The paper does not contain any explicit statements or links indicating the availability of open-source code for the methodology described.
Open Datasets	No	The paper mentions 'human-labeled preference data' and refers to a 'dataset Dpref', as well as the process of 'collecting feedback from responses sampled from the model during training'. However, it does not specify a particular publicly available dataset with concrete access information (e.g., a URL, DOI, or formal citation to an established benchmark dataset) used for empirical evaluation.
Dataset Splits	No	As this is a theoretical paper that does not conduct empirical studies with specific datasets, there is no mention of dataset splits (e.g., training, validation, test splits).
Hardware Specification	No	As this is a theoretical paper focused on algorithmic design and analysis, it does not describe any experimental setup, and therefore does not specify the hardware used.
Software Dependencies	No	As this is a theoretical paper focused on algorithmic design and analysis, it does not describe any experimental setup, and therefore does not specify software dependencies with version numbers.
Experiment Setup	No	As this is a theoretical paper focused on algorithmic design and analysis, it does not conduct empirical experiments and thus does not provide details about experimental setup, hyperparameters, or training configurations.