Preference Controllable Reinforcement Learning with Advanced Multi-Objective Optimization
Authors: Yucheng Yang, Tianyi Zhou, Mykola Pechenizkiy, Meng Fang
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate PCRL with different MOO algorithms against state-of-the-art MORL baselines in various challenging environments with up to six objectives. In these experiments, our proposed method exhibits significantly better controllability than existing approaches and can generate Pareto solutions with better diversity and utilities. We conducted experiments in environments with conflicting objectives (Felten et al., 2023) to empirically demonstrate that (1) our PCRL scheme is compatible with various MOO methods; and (2) PCRL with Pre Co consistently achieves superior performance across multiple MORL environments. |
| Researcher Affiliation | Academia | 1Department of Mathematics and Computer Science, Eindhoven University of Technology, Eindhoven, The Netherlands 2Department of Computer Science, University of Maryland, College Park, Maryland, The United States 3Department of Computer Science, University of Liverpool, Liverpool, The United Kingdom. |
| Pseudocode | Yes | Algorithm 1 Pre Co in the theoretical analysis setting... Algorithm 2 PCRL with Pre Co update |
| Open Source Code | No | The paper mentions "Our implementation of PDMORL using its official Git Hub codebase" when discussing baselines, but it does not provide any explicit statement or link for the open-source code of the methodology described in this paper (PCRL with Pre Co). |
| Open Datasets | Yes | Benchmarking Environments Common continuous control environments like MO-Hopper and MO-Ant (Felten et al., 2023) feature higher-dimensional spaces but symmetric objectives... Fruit-Tree: A discrete environment... MO-Reacher: A robotic control environment... |
| Dataset Splits | Yes | The test preferences are p P with a resolution of 0.1 for each dimension. For instance, in 3-D cases, these preferences include [0, 0, 1], [0, 0.1, 0.9], . . . , [0, 1, 0], [0.1, 0, 0.9], . . . , [0.9, 0.1, 0], [1, 0, 0], with a quantity of 66. There are 286 test preferences for 4-D, 1001 for 5-D, and 3003 for 6-D. During training, the preferences were sampled uniformly from the convex coefficient set P... |
| Hardware Specification | Yes | It is computationally inefficient, requiring 14 hours to complete training in the Fruit-Tree environment and over 70 hours for MO-Hopper, even when using an NVIDIA A100 GPU and 72 CPU cores. |
| Software Dependencies | No | The paper mentions the use of "Adam optimizer (Kingma & Ba, 2015)" but does not provide specific version numbers for any key software components or libraries (e.g., Python, PyTorch, TensorFlow versions). |
| Experiment Setup | Yes | Tables 8, 10, and 12, titled "Hyper-parameters settings MO-Ant", "Hyper-parameters settings MO-Hopper", and "Hyper-parameters settings Fruit-tree" respectively, provide detailed experimental setup information including Discount (γ), Optimizer, Learning rate, Number of hidden layers, Number of hidden units, Activation function, Batch size, Buffer Size, Starting timesteps, Gradient clipping, Exploration method, Noise distribution, Noise clipping limit, Policy frequency, Target network update rate, Maximum episode timesteps, and Evaluation episodes. |