Preference Elicitation for Offline Reinforcement Learning
Authors: Alizée Pace, Bernhard Schölkopf, Gunnar Ratsch, Giorgia Ramponi
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Finally, we demonstrate the empirical performance of Sim-OPRL in various environments. Our contributions are the following: (4) Finally, we develop a practical implementation of our algorithm and demonstrate its empirical efficiency and scalability across various decision-making environments. In this section, we demonstrate the effectiveness of our preference elicitation strategy, Sim-OPRL, across a range of offline reinforcement learning environments and datasets. Performance and sample complexity results with different preference elicitation methods are given in Figure 1 and Table 2. We conduct ablations for our algorithm on a simple tabular MDP, with results in Figure 2. We compare different preference elicitation strategies on a range of environments detailed in Appendix D. Among others, we explore environments from the D4RL benchmark (Fu et al., 2020) identified as particularly challenging offline preference-based reinforcement learning tasks (Shin et al., 2022), as well as a medical simulation designed to model the evolution of patients with sepsis (Oberst and Sontag, 2019). |
| Researcher Affiliation | Academia | Alizée Pace ETH AI Center, ETH Zürich MPI for Intelligent Systems, Tübingen EMAIL Bernhard Schölkopf MPI for Intelligent Systems & ELLIS Institute Tübingen Gunnar Rätsch ETH Zürich Giorgia Ramponi University of Zürich |
| Pseudocode | Yes | Algorithm 1 Offline Preference-based Reinforcement Learning with Preference Elicitation Algorithm 2 Preference Elicitation through Simulated Trajectory Sampling. Algorithm 3 Sim-OPRL: Practical Algorithm |
| Open Source Code | No | The paper mentions a publicly available code for the Sepsis simulator (https://github.com/clinicalml/gumbel-max-scm/ tree-sim-v2/sepsis Sim Diabetes), which is a third-party tool used in their experiments. However, there is no explicit statement or link provided for the authors' own implementation of Sim-OPRL or other methodologies described in the paper. |
| Open Datasets | Yes | We compare different preference elicitation strategies on a range of environments detailed in Appendix D. Among others, we explore environments from the D4RL benchmark (Fu et al., 2020) identified as particularly challenging offline preference-based reinforcement learning tasks (Shin et al., 2022), as well as a medical simulation designed to model the evolution of patients with sepsis (Oberst and Sontag, 2019). The halfcheetah-random-v2 dataset is also part of the D4RL benchmark (Fu et al., 2020). The sepsis simulator (Oberst and Sontag, 2019) is a commonly used environment for medically-motivated RL work (Tang and Wiens, 2021). We use the original authors publicly available code: https://github.com/clinicalml/gumbel-max-scm/ tree-sim-v2/sepsis Sim Diabetes (MIT license). |
| Dataset Splits | No | The paper mentions the size of some offline datasets: "offline dataset Doffline consists of 40 trajectories" (Star MDP), "The offline dataset contains 150 episodes" (Gridworld), "The dataset consists of 1 million transitions" (Half Cheetah-Random), and "The offline trajectories dataset includes 10,000 episodes" (Sepsis Simulation). While it refers to established benchmarks like D4RL which often have predefined splits, the paper does not explicitly provide details about how these datasets were split into training, validation, or test sets for their experiments. |
| Hardware Specification | Yes | We trained all models on two 64-core AMD processors or a single NVIDIA RTX2080Ti GPU. |
| Software Dependencies | No | The paper mentions several software components like "Adam optimizer (Kingma and Ba, 2014)", "cvxopt (Diamond and Boyd, 2016)", "Proximal Policy Optimization (Schulman et al., 2017) implemented in stable-baselines3 (Raffin et al., 2021)", and "Soft Actor-Critic (Haarnoja et al., 2018)". However, specific version numbers for these software packages or libraries are not provided. |
| Experiment Setup | Yes | For all baselines, transition and reward models were implemented as linear classifiers (for the Star MDP), as two-layer perceptions with Re LU activation and hidden layer dimension 32 (Gridworld, Sepsis, Mini Grid environments), or as 5-layer MLPs with Re LU activations and hidden sizes [512, 256, 128, 64, 32] for the Half Cheetah environments. Training was carried out for two or one epochs for the transition and reward models respectively, with the Adam optimizer (Kingma and Ba, 2014) and a learning rate of 10 3. We consider ensembles of size |T | = |R| = 5 for both transition and reward models. Hyperparameters λT , λR control the degree of conservatism... for our experiments, we simply set λT = 0.5, λR = 0.1 (Star MDP, Gridworld) and λT = λR = 1 for the Sepsis environment. for computational efficiency, we sample preferences in batches of 4 (Star MDP, Gridworld) or 100 (Sepsis) to reduce the number of model updates needed. Policy optimization stages... are carried out exactly through linear programming for the Star MDP and Gridworld using cvxopt (Diamond and Boyd, 2016), based on code from Lindner et al. (2021), using Proximal Policy Optimization (Schulman et al., 2017) implemented in stable-baselines3 (Raffin et al., 2021) for the Sepsis and Mini Grid environments, and Soft Actor-Critic (Haarnoja et al., 2018) for Half Cheetah. In the latter case, after every preference collection episode, reward and policy models were trained from the checkpoint of the previous iteration, for only 20 steps to minimize computation. |