reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Preference Elicitation for Offline Reinforcement Learning

Authors: Alizée Pace, Bernhard Schölkopf, Gunnar Ratsch, Giorgia Ramponi

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Finally, we demonstrate the empirical performance of Sim-OPRL in various environments. Our contributions are the following: (4) Finally, we develop a practical implementation of our algorithm and demonstrate its empirical efficiency and scalability across various decision-making environments. In this section, we demonstrate the effectiveness of our preference elicitation strategy, Sim-OPRL, across a range of offline reinforcement learning environments and datasets. Performance and sample complexity results with different preference elicitation methods are given in Figure 1 and Table 2. We conduct ablations for our algorithm on a simple tabular MDP, with results in Figure 2. We compare different preference elicitation strategies on a range of environments detailed in Appendix D. Among others, we explore environments from the D4RL benchmark (Fu et al., 2020) identified as particularly challenging offline preference-based reinforcement learning tasks (Shin et al., 2022), as well as a medical simulation designed to model the evolution of patients with sepsis (Oberst and Sontag, 2019).
Researcher Affiliation	Academia	Alizée Pace ETH AI Center, ETH Zürich MPI for Intelligent Systems, Tübingen EMAIL Bernhard Schölkopf MPI for Intelligent Systems & ELLIS Institute Tübingen Gunnar Rätsch ETH Zürich Giorgia Ramponi University of Zürich
Pseudocode	Yes	Algorithm 1 Offline Preference-based Reinforcement Learning with Preference Elicitation Algorithm 2 Preference Elicitation through Simulated Trajectory Sampling. Algorithm 3 Sim-OPRL: Practical Algorithm
Open Source Code	No	The paper mentions a publicly available code for the Sepsis simulator (https://github.com/clinicalml/gumbel-max-scm/ tree-sim-v2/sepsis Sim Diabetes), which is a third-party tool used in their experiments. However, there is no explicit statement or link provided for the authors' own implementation of Sim-OPRL or other methodologies described in the paper.
Open Datasets	Yes	We compare different preference elicitation strategies on a range of environments detailed in Appendix D. Among others, we explore environments from the D4RL benchmark (Fu et al., 2020) identified as particularly challenging offline preference-based reinforcement learning tasks (Shin et al., 2022), as well as a medical simulation designed to model the evolution of patients with sepsis (Oberst and Sontag, 2019). The halfcheetah-random-v2 dataset is also part of the D4RL benchmark (Fu et al., 2020). The sepsis simulator (Oberst and Sontag, 2019) is a commonly used environment for medically-motivated RL work (Tang and Wiens, 2021). We use the original authors publicly available code: https://github.com/clinicalml/gumbel-max-scm/ tree-sim-v2/sepsis Sim Diabetes (MIT license).
Dataset Splits	No	The paper mentions the size of some offline datasets: "offline dataset Doffline consists of 40 trajectories" (Star MDP), "The offline dataset contains 150 episodes" (Gridworld), "The dataset consists of 1 million transitions" (Half Cheetah-Random), and "The offline trajectories dataset includes 10,000 episodes" (Sepsis Simulation). While it refers to established benchmarks like D4RL which often have predefined splits, the paper does not explicitly provide details about how these datasets were split into training, validation, or test sets for their experiments.
Hardware Specification	Yes	We trained all models on two 64-core AMD processors or a single NVIDIA RTX2080Ti GPU.
Software Dependencies	No	The paper mentions several software components like "Adam optimizer (Kingma and Ba, 2014)", "cvxopt (Diamond and Boyd, 2016)", "Proximal Policy Optimization (Schulman et al., 2017) implemented in stable-baselines3 (Raffin et al., 2021)", and "Soft Actor-Critic (Haarnoja et al., 2018)". However, specific version numbers for these software packages or libraries are not provided.
Experiment Setup	Yes	For all baselines, transition and reward models were implemented as linear classifiers (for the Star MDP), as two-layer perceptions with Re LU activation and hidden layer dimension 32 (Gridworld, Sepsis, Mini Grid environments), or as 5-layer MLPs with Re LU activations and hidden sizes [512, 256, 128, 64, 32] for the Half Cheetah environments. Training was carried out for two or one epochs for the transition and reward models respectively, with the Adam optimizer (Kingma and Ba, 2014) and a learning rate of 10 3. We consider ensembles of size \|T \| = \|R\| = 5 for both transition and reward models. Hyperparameters λT , λR control the degree of conservatism... for our experiments, we simply set λT = 0.5, λR = 0.1 (Star MDP, Gridworld) and λT = λR = 1 for the Sepsis environment. for computational efficiency, we sample preferences in batches of 4 (Star MDP, Gridworld) or 100 (Sepsis) to reduce the number of model updates needed. Policy optimization stages... are carried out exactly through linear programming for the Star MDP and Gridworld using cvxopt (Diamond and Boyd, 2016), based on code from Lindner et al. (2021), using Proximal Policy Optimization (Schulman et al., 2017) implemented in stable-baselines3 (Raffin et al., 2021) for the Sepsis and Mini Grid environments, and Soft Actor-Critic (Haarnoja et al., 2018) for Half Cheetah. In the latter case, after every preference collection episode, reward and policy models were trained from the checkpoint of the previous iteration, for only 20 steps to minimize computation.