Exploratory Preference Optimization: Harnessing Implicit Q*-Approximation for Sample-Efficient RLHF
Authors: Tengyang Xie, Dylan Foster, Akshay Krishnamurthy, Corby Rosset, Ahmed H Awadallah, Alexander Rakhlin
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Theoretical | This paper investigates a basic question in reinforcement learning from human feedback (RLHF) from a theoretical perspective: how to efficiently explore in an online manner under preference feedback and general function approximation. We take the initial step towards a theoretical understanding of this problem by proposing a novel algorithm, Exploratory Preference Optimization (XPO). We prove that XPO is provably sample-efficient and converges to a near-optimal policy under natural exploration conditions, regardless of the initial model s coverage. Our analysis builds on the observation that DPO implicitly performs a form of Bellman error minimization. |
| Researcher Affiliation | Collaboration | Tengyang Xie UW-Madison EMAIL; Dylan J. Foster Microsoft Research EMAIL; Akshay Krishnamurthy Microsoft Research EMAIL; Corby Rosset Microsoft Research EMAIL; Ahmed Awadallah Microsoft Research EMAIL; Alexander Rakhlin MIT EMAIL |
| Pseudocode | Yes | Algorithm 1 Exploratory Preference Optimization (XPO) is displayed in Algorithm 1. The algorithm takes as input a user-specified policy class Π and proceeds in almost the same fashion as Online DPO. [...] Algorithm 2 Exploratory Preference Optimization (XPO) with general sampling policy. [...] Algorithm 3 Exploratory Preference Optimization (XPO) with historical sampling. [...] Algorithm 4 Exploratory Preference Optimization (XPO) with general sampling policy and large batch size. |
| Open Source Code | No | The paper does not contain any explicit statements or links indicating the availability of open-source code for the methodology described. |
| Open Datasets | No | The paper mentions 'human-labeled preference data' and refers to a 'dataset Dpref', as well as the process of 'collecting feedback from responses sampled from the model during training'. However, it does not specify a particular publicly available dataset with concrete access information (e.g., a URL, DOI, or formal citation to an established benchmark dataset) used for empirical evaluation. |
| Dataset Splits | No | As this is a theoretical paper that does not conduct empirical studies with specific datasets, there is no mention of dataset splits (e.g., training, validation, test splits). |
| Hardware Specification | No | As this is a theoretical paper focused on algorithmic design and analysis, it does not describe any experimental setup, and therefore does not specify the hardware used. |
| Software Dependencies | No | As this is a theoretical paper focused on algorithmic design and analysis, it does not describe any experimental setup, and therefore does not specify software dependencies with version numbers. |
| Experiment Setup | No | As this is a theoretical paper focused on algorithmic design and analysis, it does not conduct empirical experiments and thus does not provide details about experimental setup, hyperparameters, or training configurations. |