reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Demonstration-Guided Multi-Objective Reinforcement Learning

Authors: Junlin Lu, Patrick Mannion, Karl Mason

TMLR 2024 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our empirical studies demonstrate DG-MORL s superiority over state-of-the-art MORL algorithms, establishing its robustness and efficacy. 5 Experiments In this section, we introduce the baselines, benchmark environments and metrics. We then illustrate and discuss the results.
Researcher Affiliation	Academia	Junlin Lu EMAIL Patrick Mannion EMAIL Karl Mason EMAIL School of Computer Science University of Galway Galway, Ireland, H91 TK33
Pseudocode	Yes	Algorithm 1 DG-MORL Algorithm
Open Source Code	Yes	The code of this work is available on https://github.com/MORL12345/DG-MORL.git.
Open Datasets	Yes	We conduct the evaluation on MORL tasks with escalating complexity: from an instance featuring discrete states and actions, i.e. Deep Sea Treasure (DST) (Yang et al., 2019; Alegre et al., 2023) to tasks with continuous states and discrete actions, i.e. Minecart (Abels et al., 2019; Alegre et al., 2023). We also test it in control tasks with both continuous states and actions, i.e. MO-Hopper (Basaklar et al., 2023; Alegre et al., 2023), MO-Ant, and MO-Humanoid are extensions of the Mujoco continuous control tasks from Open AI Gym (Borsa et al., 2019).
Dataset Splits	No	The paper mentions experimental runs with randomly picked seeds and varying numbers of initial demonstrations but does not specify how a static dataset (if used) is split into training, validation, or test sets by percentages or counts.
Hardware Specification	Yes	The experiments are run on a machine with 12th Gen Intel(R) Core(TM) i9-12900 CPU, NVIDIA T1000 graphic card and 32 GB memory.
Software Dependencies	No	The paper mentions using 'Adam optimizer' and the 'pycddlib library' but does not provide specific version numbers for these software dependencies.
Experiment Setup	Yes	We use the same neural network architecture as the GPI-PD implementation in all benchmarks, i.e. 4 layers of 256 neurons in DST and Minecart, 2 layers with 256 neurons for both the critic and actor in MO-Hopper, MO-Ant and MO-Humanoid. We use Adam optimizer, the learning rate is 3e-4 and the batch size is 128 in all implementations. As for the exploration, we adopted the same annealing pattern epsilon-greedy strategy. In DST, the epsilon anneals from 1 to 0 during the first 50000 time steps. In Minecart, the epsilon is annealed from 1 to 0.05 in the first 50000 time steps. For the TD3 algorithm doing MO-Hopper, MO-Ant and MO-Humanoid, we take a zero-mean Gaussian noise with a standard deviation of 0.02 to actions from the actor-network. All hyper-parameters are consistent with literature (Alegre et al., 2023).