Efficient Discovery of Pareto Front for Multi-Objective Reinforcement Learning
Authors: Ruohong Liu, Yuxin Pan, Linjie Xu, Lei Song, Pengcheng You, Yize Chen, Jiang Bian
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Empirically, compared to recent advancements in MORL methods, our algorithm achieves more consistent and superior performances in terms of hypervolume, expected utility, and sparsity on both discrete and continuous control tasks, especially with numerous objectives (up to nine objectives in our experiments). Our code is available at https://github.com/Ruoh Liuq/C-MORL. 1 INTRODUCTION |
| Researcher Affiliation | Collaboration | Ruohong Liu University of Oxford Oxford, UK EMAIL Yuxin Pan The Hong Kong University of Science and Technology Hong Kong, China EMAIL Linjie Xu Queen Mary University of London London, UK EMAIL Lei Song & Jiang Bian Microsoft Research Asia Beijing, China EMAIL Pengcheng You Peking University Beijing, China EMAIL Yize Chen University of Alberta Edmonton, Canada EMAIL |
| Pseudocode | Yes | Algorithm 1 Policy Selection Algorithm 2 C-MORL |
| Open Source Code | Yes | Our code is available at https://github.com/Ruoh Liuq/C-MORL. |
| Open Datasets | Yes | In this Section, we validate the design of our proposed algorithm using both popular discrete and continuous MORL benchmarks from MO-Gymnasium (Felten et al., 2023a) and Sustain Gym (Yeh et al., 2024). These benchmarks include five comprehensive domains: (i) Grid World includes Fruit Tree, a discrete benchmark with six objectives. (ii) Classic Control includes MO-Lunar-Lander, a discrete benchmark with four objectives. (iii) Miscellaneous includes Minecart, a discrete benchmark with four objectives. (iv) Robotics Control includes five Mu Jo Co tasks with continuous action space based on Mu Jo Co simulator (Todorov et al., 2012; Xu et al., 2020). (v) Sustainable Energy Systems includes two building heating supply tasks. |
| Dataset Splits | No | The paper does not provide explicit training/test/validation dataset splits in the conventional sense for fixed datasets. Instead, it details the training duration within environments and the methodology for evaluating policies over a sampled preference space. For example: "Each of the baselines are trained for 5 * 10^5 time steps for discrete benchmarks. Continuous benchmarks with two, three, and nine objectives are trained for 1.5 * 10^6, 2 * 10^6, and 2.5 * 10^6 steps, respectively." and "For metrics evaluation, we evenly generate an evaluation preference set in a systematic manner with specified intervals = 0.01, = 0.1, and = 0.5 for benchmarks with two objectives, three or four objectives, and six or nine objectives, respectively." These describe training interaction and evaluation strategy, not dataset partitioning. |
| Hardware Specification | Yes | We run all the experiments on a cloud server including CPU Intel Xeon Processor and GPU Tesla T4. |
| Software Dependencies | No | The paper mentions algorithms like PPO and uses libraries like MORL-baselines, but does not provide specific version numbers for software components (e.g., Python, PyTorch, or the MORL-baselines library itself). For example: "In the Pareto initialization stage, we use PPO algorithm implemented by Kostrikov (2018)." and "For Envelope (Yang et al., 2019), CAPQL (Lu et al., 2022), GPILS (Alegre et al., 2023), and MORL/D (Felten et al., 2024), we utilize the implementations available in the MORL-baselines library (Felten et al., 2023a), adapting them as necessary to align with our experimental setup." |
| Experiment Setup | Yes | The PPO parameters are reported in Table 7 and Table 8. For constrained optimization, we adopt C-MORL-IPO method. [...] The hyperparameters of C-MORL-IPO include: Number of initial policy M: the number of initial policies. [...] The parameters we used are provided in Table 9 and Table 10. |