Conciliator steering: Imposing user preference in multi-objective reinforcement learning
Authors: Sara Pyykölä, Klavdiya Olegovna Bochenina, Laura Ruotsalainen
TMLR 2024 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We test Conciliator steering on Deep Sea Treasure v1 benchmark problem and demonstrate that it can find user-preferred policies with effortless and simple user-agent interaction and negligible bias, which has not been possible before. Additionally, we show that on average Conciliator steering results in a fraction of carbon dioxide emissions and total energy consumption when compared to a training of fully connected MNIST classifier, both run on a personal laptop. |
| Researcher Affiliation | Academia | Sara Pyykölä EMAIL Department of Computer Science University of Helsinki, Finland Klavdiya Bochenina EMAIL Department of Computer Science University of Helsinki, Finland Laura Ruotsalainen EMAIL Department of Computer Science University of Helsinki, Finland |
| Pseudocode | Yes | Summarizing, a diagram of the proposed solution is presented in Figure 2, while the pseudo-code for the proposed solution is presented as Algorithm 1. |
| Open Source Code | Yes | The code for the experiment and the algorithm s implementation is available in Git Hub. A modified version of Deep Sea Treasure library is also included, as our implementation introduces several bug fixes that are not available via Py PI Deep Sea Treasure v1 at the time of writing. |
| Open Datasets | Yes | We test Conciliator steering on Deep Sea Treasure v1 benchmark problem and demonstrate that it can find user-preferred policies with effortless and simple user-agent interaction and negligible bias, which has has not been possible before. ... We perform an experimental study of Conciliator steering in the Deep Sea Treasure v1 benchmark by Cassimon et al. (2022), and note that Conciliator steering produces satisfactory policies, while being simple for the user to interact with. |
| Dataset Splits | No | The paper uses the Deep Sea Treasure v1 environment, which is a configurable decision-making problem/simulation. It describes environment parameters, but does not refer to a pre-collected dataset with explicit training, validation, or test splits. The environment itself generates data through interaction. |
| Hardware Specification | Yes | Table 2: The specifications of the equipment used in the experiment as well as the resulting power and energy consumption and carbon emissions on average over the experiment, reported up to three decimals accuracy. ... CPU model Intel(R) Core(TM) i5-9300H CPU @ 2.40GHz ... GPU model Ge Force GTX 1650 with Max-Q Design ... RAM size in GB 16 |
| Software Dependencies | Yes | Table 2: The specifications of the equipment used in the experiment as well as the resulting power and energy consumption and carbon emissions on average over the experiment, reported up to three decimals accuracy. ... OS Windows 10-10.0 Python version 3.9.2 Code Carbon version 2.3.2 |
| Experiment Setup | Yes | In the recommended standard setting by Cassimon et al. n = 5, but in our tests, we chose to use n = 4. ... The time limit for the maximum duration of the policy was 50 time steps. As the baseline reward r, we used the vector (57.80, 5.32, 7.20), determined as the average of the resulting episodic rewards from the Pareto optimal policies recorded in the dataset. Three different user profiles were used: first, the priority weighting of p = (1/10, 2/10, 7/10) and the preferred reward of r = (25.81, 4.56, 0.27); second, the priority weighting of p = (98/100, 1/100, 1/100) and the preferred reward of r = (115.6, 28.93, 28.93); and third, the priority weighting of p = (1/5, 2/5, 2/5) and the preferred reward of r = (24.45, 2.05, 2.05). ... Here we used the following function for the transformation: ( ri/60, ri > 0 eri/10, ri 0. (3) ... To keep this estimate exact, we chose γ = 1 and set the length of Approximator s policies in one action for our experiments. |