reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

A Theoretical Framework for Partially-Observed Reward States in RLHF

Authors: Chinmaya Kausik, Mirco Mutti, Aldo Pacchiano, Ambuj Tewari

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Theoretical	The growing deployment of reinforcement learning from human feedback (RLHF) calls for a deeper theoretical investigation of its underlying models. The prevalent models of RLHF do not account for neuroscience-backed, partially-observed internal states that can affect human feedback, nor do they accommodate intermediate feedback during an interaction. Both of these can be instrumental in speeding up learning and improving alignment. To address these limitations, we model RLHF as reinforcement learning with partially observed reward-states (PORRL). We accommodate two kinds of feedback cardinal and dueling feedback. We first demonstrate that PORRL subsumes a wide class of RL problems, including traditional RL, RLHF, and reward machines. For cardinal feedback, we present two model-based methods (POR-UCRL, POR-UCBVI). We give both cardinal regret and sample complexity guarantees for the methods, showing that they improve over naive history-summarization. We then discuss the benefits of a model-free method like GOLF with naive history-summarization in settings with recursive internal states and dense intermediate feedback. For this purpose, we define a new history aware version of the Bellman-eluder dimension and give a new guarantee for GOLF in our setting, which can be exponentially sharper in illustrative examples. For dueling feedback, we show that a naive reduction to cardinal feedback fails to achieve sublinear dueling regret. We then present the first explicit reduction that converts guarantees for cardinal regret to dueling regret. In both feedback settings, we show that our models and guarantees generalize and extend existing ones. ... While the aim of our work is largely theoretical, we extract practical insights from our results throughout the text.
Researcher Affiliation	Academia	Chinmaya Kausik University of Michigan EMAIL Mirco Mutti Technion EMAIL Aldo Pacchiano Boston University Broad Insitute of MIT & Harvard EMAIL Ambuj Tewari University of Michigan EMAIL
Pseudocode	Yes	Algorithm 1 Reduction from Dueling to Cardinal Confidence-Set Optimism 1: Input Known reward function trhu H h 1, method to compute CMp D, δq Ø CPp D, δq ˆ CFp D, δq 2: Initialize dataset D1 Ð tu, CMp D1, δq : P ˆ F 3: for t 1, ..., T do ... Algorithm 2 Generic Confidence-Set Optimism ... Algorithm 3 Generic Bonus-Based Optimism ... Algorithm 4 Generic Model-Free Optimism ... Algorithm 5 POR-UCRL ... Algorithm 6 POR-UCBVI ... Algorithm 7 GOLF ... Algorithm 8 Reduction from Dueling to Cardinal Bonus-Based Optimism
Open Source Code	No	The paper does not provide any explicit statements about open-sourcing code, nor does it include links to a code repository or mention code in supplementary materials.
Open Datasets	No	The paper is theoretical and does not use any real-world datasets for empirical evaluation. It describes a 'combination lock' as a 'toy model' and 'illustrative hard example' in Section 2.1, but this is for conceptual demonstration rather than experimental analysis.
Dataset Splits	No	The paper does not conduct experiments on datasets, therefore no dataset splits are provided.
Hardware Specification	No	The paper does not describe any experiments that would require specific hardware, and thus no hardware specifications are mentioned.
Software Dependencies	No	The paper describes theoretical frameworks and algorithms but does not provide details on specific software dependencies or version numbers for implementation.
Experiment Setup	No	The paper presents theoretical contributions and does not include an experimental setup with hyperparameters or training configurations.