Conciliator steering: Imposing user preference in multi-objective reinforcement learning

Authors: Sara Pyykölä, Klavdiya Olegovna Bochenina, Laura Ruotsalainen

TMLR 2024 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We test Conciliator steering on Deep Sea Treasure v1 benchmark problem and demonstrate that it can find user-preferred policies with effortless and simple user-agent interaction and negligible bias, which has not been possible before. Additionally, we show that on average Conciliator steering results in a fraction of carbon dioxide emissions and total energy consumption when compared to a training of fully connected MNIST classifier, both run on a personal laptop.
Researcher Affiliation Academia Sara Pyykölä EMAIL Department of Computer Science University of Helsinki, Finland Klavdiya Bochenina EMAIL Department of Computer Science University of Helsinki, Finland Laura Ruotsalainen EMAIL Department of Computer Science University of Helsinki, Finland
Pseudocode Yes Summarizing, a diagram of the proposed solution is presented in Figure 2, while the pseudo-code for the proposed solution is presented as Algorithm 1.
Open Source Code Yes The code for the experiment and the algorithm s implementation is available in Git Hub. A modified version of Deep Sea Treasure library is also included, as our implementation introduces several bug fixes that are not available via Py PI Deep Sea Treasure v1 at the time of writing.
Open Datasets Yes We test Conciliator steering on Deep Sea Treasure v1 benchmark problem and demonstrate that it can find user-preferred policies with effortless and simple user-agent interaction and negligible bias, which has has not been possible before. ... We perform an experimental study of Conciliator steering in the Deep Sea Treasure v1 benchmark by Cassimon et al. (2022), and note that Conciliator steering produces satisfactory policies, while being simple for the user to interact with.
Dataset Splits No The paper uses the Deep Sea Treasure v1 environment, which is a configurable decision-making problem/simulation. It describes environment parameters, but does not refer to a pre-collected dataset with explicit training, validation, or test splits. The environment itself generates data through interaction.
Hardware Specification Yes Table 2: The specifications of the equipment used in the experiment as well as the resulting power and energy consumption and carbon emissions on average over the experiment, reported up to three decimals accuracy. ... CPU model Intel(R) Core(TM) i5-9300H CPU @ 2.40GHz ... GPU model Ge Force GTX 1650 with Max-Q Design ... RAM size in GB 16
Software Dependencies Yes Table 2: The specifications of the equipment used in the experiment as well as the resulting power and energy consumption and carbon emissions on average over the experiment, reported up to three decimals accuracy. ... OS Windows 10-10.0 Python version 3.9.2 Code Carbon version 2.3.2
Experiment Setup Yes In the recommended standard setting by Cassimon et al. n = 5, but in our tests, we chose to use n = 4. ... The time limit for the maximum duration of the policy was 50 time steps. As the baseline reward r, we used the vector (57.80, 5.32, 7.20), determined as the average of the resulting episodic rewards from the Pareto optimal policies recorded in the dataset. Three different user profiles were used: first, the priority weighting of p = (1/10, 2/10, 7/10) and the preferred reward of r = (25.81, 4.56, 0.27); second, the priority weighting of p = (98/100, 1/100, 1/100) and the preferred reward of r = (115.6, 28.93, 28.93); and third, the priority weighting of p = (1/5, 2/5, 2/5) and the preferred reward of r = (24.45, 2.05, 2.05). ... Here we used the following function for the transformation: ( ri/60, ri > 0 eri/10, ri 0. (3) ... To keep this estimate exact, we chose γ = 1 and set the length of Approximator s policies in one action for our experiments.