Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1]
Multi-Objective Reinforcement Learning using Sets of Pareto Dominating Policies
Authors: Kristof Van Moffaert, Ann Nowé
JMLR 2014 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We experimentally validate the algorithm on multiple environments with two and three objectives and we demonstrate that Pareto Q-learning outperforms current state-of-the-art MORL algorithms with respect to the hypervolume of the obtained policies. (...) In Section 4, we conduct an empirical comparison of our algorithm to other state-of-the-art MORL algorithms. |
| Researcher Affiliation | Academia | Kristof Van Moffaert EMAIL Ann Nowe EMAIL Department of Computer Science Vrije Universiteit Brussel Pleinlaan 2, Brussels, Belgium |
| Pseudocode | Yes | Algorithm 1 Single-objective Q-learning algorithm Algorithm 2 Scalarized ϵ-greedy strategy, scal-ϵ-greedy() Algorithm 3 Scalarized multi-objective Q-learning algorithm Algorithm 4 Pareto Q-learning algorithm Algorithm 5 Hypervolume Qset evaluation Algorithm 6 Cardinality Qset evaluation Algorithm 7 Track policy π given the expected reward vector Vπ(s) from state s |
| Open Source Code | No | The paper does not contain any explicit statements about releasing source code, nor does it provide any links to code repositories. |
| Open Datasets | Yes | The Pressurized Bountiful Sea Treasure (PBST) environment, which is inspired by the Deep Sea Treasure (DST) environment (Vamplew et al., 2010). The Deep Sea Treasure (DST) is proposed by Vamplew et al. (2010) and is a standard MORL benchmark instance. |
| Dataset Splits | Yes | As we are learning multiple policies simultaneously, which potentially may involve different parts of the state space, we found it beneficial to employ a train and test setting, where in the train mode, we learn with an ϵ-greedy action selection strategy with decreasing epsilon. In the test mode of the algorithm, we perform multiple greedy policies using Algorithm 7 for every element in ND( a ˆQset(s0, a)) of the start state s0 and we average the accumulated returns along the paths. |
| Hardware Specification | No | The paper does not provide specific hardware details (e.g., GPU/CPU models, memory amounts, or detailed computer specifications) used for running its experiments. It does not mention any specific hardware setup. |
| Software Dependencies | No | The paper does not provide specific ancillary software details (e.g., library or solver names with version numbers) needed to replicate the experiment. |
| Experiment Setup | Yes | All the experiments are averaged over 50 runs and their 95% confidence interval is depicted at regular intervals. (...) At episode eps, we assigned ϵ to be 0.997eps to allow for significant amounts of exploration in early runs while maximizing exploitation in later runs of the experiment. (...) Assume that the discount factor γ is set to 1 for simplicity reasons. |