reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Mildly Constrained Evaluation Policy for Offline Reinforcement Learning

Authors: Linjie Xu, zhengyao jiang, Jinyu Wang, Lei Song, Jiang Bian

TMLR 2024 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	The empirical results on D4RL Mu Jo Co locomotion, high-dimensional humanoid and a set of 16 robotic manipulation tasks show that the MCEP brought significant performance improvement on classic offline RL methods and can further improve SOTA methods.
Researcher Affiliation	Collaboration	Linjie Xu EMAIL Queen Mary University of London Zhengyao Jiang EMAIL University College London Jinyu Wang, Lei Song and Jiang Bian EMAIL Microsoft Research Asia
Pseudocode	Yes	The overall algorithm is shown as pseudo-codes (Alg. 1). At each step, the Q πψ, πψ and πe ϕ are updated iteratively.
Open Source Code	Yes	The codes are open-sourced at https://github.com/egg-west/MCEP.git.
Open Datasets	Yes	Environments D4RL (Fu et al., 2020) is an offline RL benchmark consisting of many task sets. Our experiments select 3 versions of Mu Jo Co locomotion (-v2) datasets... Finally, we consider a set of 16 complex Robotic Manipulation tasks from (Hussing et al., 2023).
Dataset Splits	No	As the offline RL training does not depend on the environment, all the reported results (except for the training process visualization) are produced by evaluating the learned policy on the environment where the data is collected. The paper uses pre-collected datasets (D4RL, Humanoid, Robotic Manipulation) for offline learning, but does not specify explicit training/validation/test splits of these datasets for the model's learning process. The entire dataset is used for learning the policy, which is then evaluated in the environment.
Hardware Specification	No	The paper does not provide specific details about the hardware (e.g., CPU, GPU models, memory) used for running the experiments.
Software Dependencies	No	Our implementations of TD3BC, TD3BC-MCEP, AWAC, and AWAC-MCEP are based on (Kostrikov, 2022) framework. In all re-implemented/implemented methods, clipped double Q-learning (Fujimoto et al., 2018) is used. The paper mentions using specific frameworks and implementations but does not provide specific version numbers for software dependencies like JAX, Python, or other libraries.
Experiment Setup	Yes	The full list of hyper-parameters can be found in Section A.1. ... The full list of hyper-parameters used in the experiments can be found in Table 2. ... The final selected hyperparameters are listed in Table 4.