Demonstration-Guided Multi-Objective Reinforcement Learning

Authors: Junlin Lu, Patrick Mannion, Karl Mason

TMLR 2024 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our empirical studies demonstrate DG-MORL s superiority over state-of-the-art MORL algorithms, establishing its robustness and efficacy. 5 Experiments In this section, we introduce the baselines, benchmark environments and metrics. We then illustrate and discuss the results.
Researcher Affiliation Academia Junlin Lu EMAIL Patrick Mannion EMAIL Karl Mason EMAIL School of Computer Science University of Galway Galway, Ireland, H91 TK33
Pseudocode Yes Algorithm 1 DG-MORL Algorithm
Open Source Code Yes The code of this work is available on https://github.com/MORL12345/DG-MORL.git.
Open Datasets Yes We conduct the evaluation on MORL tasks with escalating complexity: from an instance featuring discrete states and actions, i.e. Deep Sea Treasure (DST) (Yang et al., 2019; Alegre et al., 2023) to tasks with continuous states and discrete actions, i.e. Minecart (Abels et al., 2019; Alegre et al., 2023). We also test it in control tasks with both continuous states and actions, i.e. MO-Hopper (Basaklar et al., 2023; Alegre et al., 2023), MO-Ant, and MO-Humanoid are extensions of the Mujoco continuous control tasks from Open AI Gym (Borsa et al., 2019).
Dataset Splits No The paper mentions experimental runs with randomly picked seeds and varying numbers of initial demonstrations but does not specify how a static dataset (if used) is split into training, validation, or test sets by percentages or counts.
Hardware Specification Yes The experiments are run on a machine with 12th Gen Intel(R) Core(TM) i9-12900 CPU, NVIDIA T1000 graphic card and 32 GB memory.
Software Dependencies No The paper mentions using 'Adam optimizer' and the 'pycddlib library' but does not provide specific version numbers for these software dependencies.
Experiment Setup Yes We use the same neural network architecture as the GPI-PD implementation in all benchmarks, i.e. 4 layers of 256 neurons in DST and Minecart, 2 layers with 256 neurons for both the critic and actor in MO-Hopper, MO-Ant and MO-Humanoid. We use Adam optimizer, the learning rate is 3e-4 and the batch size is 128 in all implementations. As for the exploration, we adopted the same annealing pattern epsilon-greedy strategy. In DST, the epsilon anneals from 1 to 0 during the first 50000 time steps. In Minecart, the epsilon is annealed from 1 to 0.05 in the first 50000 time steps. For the TD3 algorithm doing MO-Hopper, MO-Ant and MO-Humanoid, we take a zero-mean Gaussian noise with a standard deviation of 0.02 to actions from the actor-network. All hyper-parameters are consistent with literature (Alegre et al., 2023).