Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1]

Multi-Objective Reinforcement Learning using Sets of Pareto Dominating Policies

Authors: Kristof Van Moffaert, Ann Nowé

JMLR 2014 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We experimentally validate the algorithm on multiple environments with two and three objectives and we demonstrate that Pareto Q-learning outperforms current state-of-the-art MORL algorithms with respect to the hypervolume of the obtained policies. (...) In Section 4, we conduct an empirical comparison of our algorithm to other state-of-the-art MORL algorithms.
Researcher Affiliation Academia Kristof Van Moffaert EMAIL Ann Nowe EMAIL Department of Computer Science Vrije Universiteit Brussel Pleinlaan 2, Brussels, Belgium
Pseudocode Yes Algorithm 1 Single-objective Q-learning algorithm Algorithm 2 Scalarized ϵ-greedy strategy, scal-ϵ-greedy() Algorithm 3 Scalarized multi-objective Q-learning algorithm Algorithm 4 Pareto Q-learning algorithm Algorithm 5 Hypervolume Qset evaluation Algorithm 6 Cardinality Qset evaluation Algorithm 7 Track policy π given the expected reward vector Vπ(s) from state s
Open Source Code No The paper does not contain any explicit statements about releasing source code, nor does it provide any links to code repositories.
Open Datasets Yes The Pressurized Bountiful Sea Treasure (PBST) environment, which is inspired by the Deep Sea Treasure (DST) environment (Vamplew et al., 2010). The Deep Sea Treasure (DST) is proposed by Vamplew et al. (2010) and is a standard MORL benchmark instance.
Dataset Splits Yes As we are learning multiple policies simultaneously, which potentially may involve different parts of the state space, we found it beneficial to employ a train and test setting, where in the train mode, we learn with an ϵ-greedy action selection strategy with decreasing epsilon. In the test mode of the algorithm, we perform multiple greedy policies using Algorithm 7 for every element in ND( a ˆQset(s0, a)) of the start state s0 and we average the accumulated returns along the paths.
Hardware Specification No The paper does not provide specific hardware details (e.g., GPU/CPU models, memory amounts, or detailed computer specifications) used for running its experiments. It does not mention any specific hardware setup.
Software Dependencies No The paper does not provide specific ancillary software details (e.g., library or solver names with version numbers) needed to replicate the experiment.
Experiment Setup Yes All the experiments are averaged over 50 runs and their 95% confidence interval is depicted at regular intervals. (...) At episode eps, we assigned ϵ to be 0.997eps to allow for significant amounts of exploration in early runs while maximizing exploitation in later runs of the experiment. (...) Assume that the discount factor γ is set to 1 for simplicity reasons.