Demonstration-Guided Multi-Objective Reinforcement Learning
Authors: Junlin Lu, Patrick Mannion, Karl Mason
TMLR 2024 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our empirical studies demonstrate DG-MORL s superiority over state-of-the-art MORL algorithms, establishing its robustness and efficacy. 5 Experiments In this section, we introduce the baselines, benchmark environments and metrics. We then illustrate and discuss the results. |
| Researcher Affiliation | Academia | Junlin Lu EMAIL Patrick Mannion EMAIL Karl Mason EMAIL School of Computer Science University of Galway Galway, Ireland, H91 TK33 |
| Pseudocode | Yes | Algorithm 1 DG-MORL Algorithm |
| Open Source Code | Yes | The code of this work is available on https://github.com/MORL12345/DG-MORL.git. |
| Open Datasets | Yes | We conduct the evaluation on MORL tasks with escalating complexity: from an instance featuring discrete states and actions, i.e. Deep Sea Treasure (DST) (Yang et al., 2019; Alegre et al., 2023) to tasks with continuous states and discrete actions, i.e. Minecart (Abels et al., 2019; Alegre et al., 2023). We also test it in control tasks with both continuous states and actions, i.e. MO-Hopper (Basaklar et al., 2023; Alegre et al., 2023), MO-Ant, and MO-Humanoid are extensions of the Mujoco continuous control tasks from Open AI Gym (Borsa et al., 2019). |
| Dataset Splits | No | The paper mentions experimental runs with randomly picked seeds and varying numbers of initial demonstrations but does not specify how a static dataset (if used) is split into training, validation, or test sets by percentages or counts. |
| Hardware Specification | Yes | The experiments are run on a machine with 12th Gen Intel(R) Core(TM) i9-12900 CPU, NVIDIA T1000 graphic card and 32 GB memory. |
| Software Dependencies | No | The paper mentions using 'Adam optimizer' and the 'pycddlib library' but does not provide specific version numbers for these software dependencies. |
| Experiment Setup | Yes | We use the same neural network architecture as the GPI-PD implementation in all benchmarks, i.e. 4 layers of 256 neurons in DST and Minecart, 2 layers with 256 neurons for both the critic and actor in MO-Hopper, MO-Ant and MO-Humanoid. We use Adam optimizer, the learning rate is 3e-4 and the batch size is 128 in all implementations. As for the exploration, we adopted the same annealing pattern epsilon-greedy strategy. In DST, the epsilon anneals from 1 to 0 during the first 50000 time steps. In Minecart, the epsilon is annealed from 1 to 0.05 in the first 50000 time steps. For the TD3 algorithm doing MO-Hopper, MO-Ant and MO-Humanoid, we take a zero-mean Gaussian noise with a standard deviation of 0.02 to actions from the actor-network. All hyper-parameters are consistent with literature (Alegre et al., 2023). |