reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Preference Controllable Reinforcement Learning with Advanced Multi-Objective Optimization

Authors: Yucheng Yang, Tianyi Zhou, Mykola Pechenizkiy, Meng Fang

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate PCRL with different MOO algorithms against state-of-the-art MORL baselines in various challenging environments with up to six objectives. In these experiments, our proposed method exhibits significantly better controllability than existing approaches and can generate Pareto solutions with better diversity and utilities. We conducted experiments in environments with conflicting objectives (Felten et al., 2023) to empirically demonstrate that (1) our PCRL scheme is compatible with various MOO methods; and (2) PCRL with Pre Co consistently achieves superior performance across multiple MORL environments.
Researcher Affiliation	Academia	1Department of Mathematics and Computer Science, Eindhoven University of Technology, Eindhoven, The Netherlands 2Department of Computer Science, University of Maryland, College Park, Maryland, The United States 3Department of Computer Science, University of Liverpool, Liverpool, The United Kingdom.
Pseudocode	Yes	Algorithm 1 Pre Co in the theoretical analysis setting... Algorithm 2 PCRL with Pre Co update
Open Source Code	No	The paper mentions "Our implementation of PDMORL using its official Git Hub codebase" when discussing baselines, but it does not provide any explicit statement or link for the open-source code of the methodology described in this paper (PCRL with Pre Co).
Open Datasets	Yes	Benchmarking Environments Common continuous control environments like MO-Hopper and MO-Ant (Felten et al., 2023) feature higher-dimensional spaces but symmetric objectives... Fruit-Tree: A discrete environment... MO-Reacher: A robotic control environment...
Dataset Splits	Yes	The test preferences are p P with a resolution of 0.1 for each dimension. For instance, in 3-D cases, these preferences include [0, 0, 1], [0, 0.1, 0.9], . . . , [0, 1, 0], [0.1, 0, 0.9], . . . , [0.9, 0.1, 0], [1, 0, 0], with a quantity of 66. There are 286 test preferences for 4-D, 1001 for 5-D, and 3003 for 6-D. During training, the preferences were sampled uniformly from the convex coefficient set P...
Hardware Specification	Yes	It is computationally inefficient, requiring 14 hours to complete training in the Fruit-Tree environment and over 70 hours for MO-Hopper, even when using an NVIDIA A100 GPU and 72 CPU cores.
Software Dependencies	No	The paper mentions the use of "Adam optimizer (Kingma & Ba, 2015)" but does not provide specific version numbers for any key software components or libraries (e.g., Python, PyTorch, TensorFlow versions).
Experiment Setup	Yes	Tables 8, 10, and 12, titled "Hyper-parameters settings MO-Ant", "Hyper-parameters settings MO-Hopper", and "Hyper-parameters settings Fruit-tree" respectively, provide detailed experimental setup information including Discount (γ), Optimizer, Learning rate, Number of hidden layers, Number of hidden units, Activation function, Batch size, Buffer Size, Starting timesteps, Gradient clipping, Exploration method, Noise distribution, Noise clipping limit, Policy frequency, Target network update rate, Maximum episode timesteps, and Evaluation episodes.