Mildly Constrained Evaluation Policy for Offline Reinforcement Learning
Authors: Linjie Xu, zhengyao jiang, Jinyu Wang, Lei Song, Jiang Bian
TMLR 2024 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | The empirical results on D4RL Mu Jo Co locomotion, high-dimensional humanoid and a set of 16 robotic manipulation tasks show that the MCEP brought significant performance improvement on classic offline RL methods and can further improve SOTA methods. |
| Researcher Affiliation | Collaboration | Linjie Xu EMAIL Queen Mary University of London Zhengyao Jiang EMAIL University College London Jinyu Wang, Lei Song and Jiang Bian EMAIL Microsoft Research Asia |
| Pseudocode | Yes | The overall algorithm is shown as pseudo-codes (Alg. 1). At each step, the Q πψ, πψ and πe ϕ are updated iteratively. |
| Open Source Code | Yes | The codes are open-sourced at https://github.com/egg-west/MCEP.git. |
| Open Datasets | Yes | Environments D4RL (Fu et al., 2020) is an offline RL benchmark consisting of many task sets. Our experiments select 3 versions of Mu Jo Co locomotion (-v2) datasets... Finally, we consider a set of 16 complex Robotic Manipulation tasks from (Hussing et al., 2023). |
| Dataset Splits | No | As the offline RL training does not depend on the environment, all the reported results (except for the training process visualization) are produced by evaluating the learned policy on the environment where the data is collected. The paper uses pre-collected datasets (D4RL, Humanoid, Robotic Manipulation) for offline learning, but does not specify explicit training/validation/test splits of these datasets for the model's learning process. The entire dataset is used for learning the policy, which is then evaluated in the environment. |
| Hardware Specification | No | The paper does not provide specific details about the hardware (e.g., CPU, GPU models, memory) used for running the experiments. |
| Software Dependencies | No | Our implementations of TD3BC, TD3BC-MCEP, AWAC, and AWAC-MCEP are based on (Kostrikov, 2022) framework. In all re-implemented/implemented methods, clipped double Q-learning (Fujimoto et al., 2018) is used. The paper mentions using specific frameworks and implementations but does not provide specific version numbers for software dependencies like JAX, Python, or other libraries. |
| Experiment Setup | Yes | The full list of hyper-parameters can be found in Section A.1. ... The full list of hyper-parameters used in the experiments can be found in Table 2. ... The final selected hyperparameters are listed in Table 4. |