Pareto Set Learning for Multi-Objective Reinforcement Learning

Authors: Erlong Liu, Yu-Chang Wu, Xiaobin Huang, Chengrui Gao, Ren-Jian Wang, Ke Xue, Chao Qian

AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Through extensive experiments on diverse benchmarks, we demonstrate the effectiveness of PSL-MORL in achieving dense coverage of the Pareto front, significantly outperforming state-of-the-art MORL methods in the hypervolume and sparsity indicators.
Researcher Affiliation Academia National Key Laboratory for Novel Software Technology, Nanjing University, Nanjing 210023, China School of Artificial Intelligence, Nanjing University, Nanjing 210023, China EMAIL
Pseudocode Yes Algorithm 1: PSL-MORL Input: Preference distribution Λ, environment E, number E of episodes, number K of weights per episode, replay buffer D, batch size N Output: Hypernetwork parameters ϕ and primary policy network parameters θ1
Open Source Code No The paper does not contain any explicit statements about code release, nor does it provide any links to a code repository.
Open Datasets Yes MO-Mu Jo Co is a popular MORL benchmark based on the Mo Joc Co physics simulation environment, consisting of several continuous control tasks. We conduct experiments under five different environments, including MO-Half Cheetah-v2, MO-Hopper-v2, MO-Ant-v2, MO-Swimmer-v2, and MO-Walker-v2. The number of objectives is two for all environments in MO-Mu Jo Co. FTN is a discrete MORL benchmark with six objectives, whose goal is to navigate the tree to harvest fruit to optimize six nutritional values on specific preferences.
Dataset Splits No The paper describes using continuous simulation environments (MO-Mu Jo Co) and a discrete navigation task (FTN) for Reinforcement Learning, where agents interact with the environment rather than using static datasets with predefined train/test/validation splits. It mentions sampling preferences and using a replay buffer but does not specify dataset splits in the conventional sense for reproducibility.
Hardware Specification No The paper does not provide specific hardware details such as GPU models, CPU types, or memory specifications used for running the experiments. It only mentions general experimental settings without hardware specifics.
Software Dependencies No The paper mentions using specific algorithms like Double Deep Q-Network (DDQN) and Twin Delayed Deep Deterministic Policy Gradient (TD3), and techniques like Hindsight Experience Replay (HER). However, it does not specify any software names with their version numbers (e.g., Python 3.x, PyTorch 1.x, specific library versions) that would be needed for replication.
Experiment Setup No The paper mentions general experimental settings such as "number E of episodes, number K of weights per episode, replay buffer D, batch size N" in Algorithm 1, and that "For the choice of α, we pick the value that owns the best performance in grid search experiments, as shown in Appendix A.6." and "The full details of the MLP model can be found in Appendix A in the full version." However, it defers concrete hyperparameter values or detailed configurations to an appendix not provided, or only describes them generically without specific values in the main text. The main text only states that "The reference points for hypervolume calculation are set to (0, 0)" and that experiments are run "with six different random seeds", which are not comprehensive setup details.