reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

GUARD: A Safe Reinforcement Learning Benchmark

Authors: Weiye Zhao, Yifan Sun, Feihan Li, Rui Chen, Ruixuan Liu, Tianhao Wei, Changliu Liu

TMLR 2024 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We present a comparison of state-of-the-art on-policy safe RL algorithms in various task settings using GUARD and establish baselines that future work can build on. In GUARD experiments, our objective is to assess the performance of safe RL algorithms across a diverse range of benchmark testing suites. These suites are meticulously designed, incorporating all available robot options as detailed in Section 5.1 and all task options outlined in Section 5.2. Additionally, we offer seamless integration of various constraint options into these benchmark testing suites, allowing users to select desired constraint types, numbers, sizes, and other parameters. Considering the diversity in robots, tasks, constraint types, and difficulty levels, we have curated 72 test suites.
Researcher Affiliation	Academia	Weiye Zhao EMAIL Robotics Institute Carnegie Mellon University Yifan Sun EMAIL Robotics Institute Carnegie Mellon University Feihan Li EMAIL Robotics Institute Carnegie Mellon University Rui Chen EMAIL Robotics Institute Carnegie Mellon University Ruixuan Liu EMAIL Robotics Institute Carnegie Mellon University Tianhao Wei EMAIL Robotics Institute Carnegie Mellon University Changliu Liu EMAIL Robotics Institute Carnegie Mellon University
Pseudocode	No	The paper describes algorithms such as CPO, PCPO, TRPO-Lagrangian, TRPO-FAC, TRPO-IPO, Safety Layer, and USL using mathematical formulations and descriptions of their logic (e.g., equations 2-7). However, there are no explicitly labeled pseudocode blocks or algorithm figures presenting structured, step-by-step procedures in a code-like format.
Open Source Code	Yes	The code is available on Github1. 1https://github.com/intelligent-control-lab/guard
Open Datasets	No	The paper introduces GUARD as a benchmark, which includes a testing suite with robot options, task options, and constraint options (Sections 5 and 6.1). While the code for this benchmark is open-source, these are environments that generate data during experiments, rather than pre-existing, externally provided datasets with specific access information (links, DOIs, or citations to data repositories). The paper does not provide concrete access information for a traditionally understood 'dataset' used in its experiments, beyond the environments defined within its own open-source code.
Dataset Splits	No	The paper describes the
Hardware Specification	Yes	Each model is trained on a server with a 48-core Intel(R) Xeon(R) Silver 4214 CPU @ 2.2.GHz, Nvidia RTX A4000 GPU with 16GB memory, and Ubuntu 20.04.
Software Dependencies	No	The paper states that GUARD is implemented in PyTorch and that networks are trained using the Adam optimizer. It also mentions Ubuntu 20.04 as the operating system. However, it does not specify version numbers for PyTorch, Python, or the Adam optimizer, which are key software components required for full reproducibility.
Experiment Setup	Yes	In GUARD experiments, our objective is to assess the performance of safe RL algorithms across a diverse range of benchmark testing suites. These suites are meticulously designed, incorporating all available robot options as detailed in Section 5.1 and all task options outlined in Section 5.2. Additionally, we offer seamless integration of various constraint options into these benchmark testing suites, allowing users to select desired constraint types, numbers, sizes, and other parameters. The hyper-parameters used in our experiments are listed in Table 19 as default. Our experiments use separate multilayer perceptrons with tanh activations for the policy network, value network and cost network. Each network consists of two hidden layers of size (64,64). All of the networks are trained using Adam optimizer with a learning rate of 0.01. During each epoch the agent interacts B times with the environment and then performs a policy update based on the experience collected from the current epoch. The maximum length of the trajectory is set to 1000 and the total epoch number N is set to 200 as default. For all experiments, we use a discount factor of γ = 0.99, an advantage discount factor λ = 0.95, and a KL-divergence step size of δKL = 0.02. For experiments which consider cost constraints we adopt a target cost δc = 0.0 to pursue a zero-violation policy. Other unique hyper-parameters for each algorithm are hand-tuned to attain reasonable performance. Further details are provided in Table 19.