Exploratory Preference Optimization: Harnessing Implicit Q*-Approximation for Sample-Efficient RLHF

Authors: Tengyang Xie, Dylan Foster, Akshay Krishnamurthy, Corby Rosset, Ahmed H Awadallah, Alexander Rakhlin

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Theoretical This paper investigates a basic question in reinforcement learning from human feedback (RLHF) from a theoretical perspective: how to efficiently explore in an online manner under preference feedback and general function approximation. We take the initial step towards a theoretical understanding of this problem by proposing a novel algorithm, Exploratory Preference Optimization (XPO). We prove that XPO is provably sample-efficient and converges to a near-optimal policy under natural exploration conditions, regardless of the initial model s coverage. Our analysis builds on the observation that DPO implicitly performs a form of Bellman error minimization.
Researcher Affiliation Collaboration Tengyang Xie UW-Madison EMAIL; Dylan J. Foster Microsoft Research EMAIL; Akshay Krishnamurthy Microsoft Research EMAIL; Corby Rosset Microsoft Research EMAIL; Ahmed Awadallah Microsoft Research EMAIL; Alexander Rakhlin MIT EMAIL
Pseudocode Yes Algorithm 1 Exploratory Preference Optimization (XPO) is displayed in Algorithm 1. The algorithm takes as input a user-specified policy class Π and proceeds in almost the same fashion as Online DPO. [...] Algorithm 2 Exploratory Preference Optimization (XPO) with general sampling policy. [...] Algorithm 3 Exploratory Preference Optimization (XPO) with historical sampling. [...] Algorithm 4 Exploratory Preference Optimization (XPO) with general sampling policy and large batch size.
Open Source Code No The paper does not contain any explicit statements or links indicating the availability of open-source code for the methodology described.
Open Datasets No The paper mentions 'human-labeled preference data' and refers to a 'dataset Dpref', as well as the process of 'collecting feedback from responses sampled from the model during training'. However, it does not specify a particular publicly available dataset with concrete access information (e.g., a URL, DOI, or formal citation to an established benchmark dataset) used for empirical evaluation.
Dataset Splits No As this is a theoretical paper that does not conduct empirical studies with specific datasets, there is no mention of dataset splits (e.g., training, validation, test splits).
Hardware Specification No As this is a theoretical paper focused on algorithmic design and analysis, it does not describe any experimental setup, and therefore does not specify the hardware used.
Software Dependencies No As this is a theoretical paper focused on algorithmic design and analysis, it does not describe any experimental setup, and therefore does not specify software dependencies with version numbers.
Experiment Setup No As this is a theoretical paper focused on algorithmic design and analysis, it does not conduct empirical experiments and thus does not provide details about experimental setup, hyperparameters, or training configurations.