GUARD: A Safe Reinforcement Learning Benchmark
Authors: Weiye Zhao, Yifan Sun, Feihan Li, Rui Chen, Ruixuan Liu, Tianhao Wei, Changliu Liu
TMLR 2024 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We present a comparison of state-of-the-art on-policy safe RL algorithms in various task settings using GUARD and establish baselines that future work can build on. In GUARD experiments, our objective is to assess the performance of safe RL algorithms across a diverse range of benchmark testing suites. These suites are meticulously designed, incorporating all available robot options as detailed in Section 5.1 and all task options outlined in Section 5.2. Additionally, we offer seamless integration of various constraint options into these benchmark testing suites, allowing users to select desired constraint types, numbers, sizes, and other parameters. Considering the diversity in robots, tasks, constraint types, and difficulty levels, we have curated 72 test suites. |
| Researcher Affiliation | Academia | Weiye Zhao EMAIL Robotics Institute Carnegie Mellon University Yifan Sun EMAIL Robotics Institute Carnegie Mellon University Feihan Li EMAIL Robotics Institute Carnegie Mellon University Rui Chen EMAIL Robotics Institute Carnegie Mellon University Ruixuan Liu EMAIL Robotics Institute Carnegie Mellon University Tianhao Wei EMAIL Robotics Institute Carnegie Mellon University Changliu Liu EMAIL Robotics Institute Carnegie Mellon University |
| Pseudocode | No | The paper describes algorithms such as CPO, PCPO, TRPO-Lagrangian, TRPO-FAC, TRPO-IPO, Safety Layer, and USL using mathematical formulations and descriptions of their logic (e.g., equations 2-7). However, there are no explicitly labeled pseudocode blocks or algorithm figures presenting structured, step-by-step procedures in a code-like format. |
| Open Source Code | Yes | The code is available on Github1. 1https://github.com/intelligent-control-lab/guard |
| Open Datasets | No | The paper introduces GUARD as a benchmark, which includes a testing suite with robot options, task options, and constraint options (Sections 5 and 6.1). While the code for this benchmark is open-source, these are environments that generate data during experiments, rather than pre-existing, externally provided datasets with specific access information (links, DOIs, or citations to data repositories). The paper does not provide concrete access information for a traditionally understood 'dataset' used in its experiments, beyond the environments defined within its own open-source code. |
| Dataset Splits | No | The paper describes the |
| Hardware Specification | Yes | Each model is trained on a server with a 48-core Intel(R) Xeon(R) Silver 4214 CPU @ 2.2.GHz, Nvidia RTX A4000 GPU with 16GB memory, and Ubuntu 20.04. |
| Software Dependencies | No | The paper states that GUARD is implemented in PyTorch and that networks are trained using the Adam optimizer. It also mentions Ubuntu 20.04 as the operating system. However, it does not specify version numbers for PyTorch, Python, or the Adam optimizer, which are key software components required for full reproducibility. |
| Experiment Setup | Yes | In GUARD experiments, our objective is to assess the performance of safe RL algorithms across a diverse range of benchmark testing suites. These suites are meticulously designed, incorporating all available robot options as detailed in Section 5.1 and all task options outlined in Section 5.2. Additionally, we offer seamless integration of various constraint options into these benchmark testing suites, allowing users to select desired constraint types, numbers, sizes, and other parameters. The hyper-parameters used in our experiments are listed in Table 19 as default. Our experiments use separate multilayer perceptrons with tanh activations for the policy network, value network and cost network. Each network consists of two hidden layers of size (64,64). All of the networks are trained using Adam optimizer with a learning rate of 0.01. During each epoch the agent interacts B times with the environment and then performs a policy update based on the experience collected from the current epoch. The maximum length of the trajectory is set to 1000 and the total epoch number N is set to 200 as default. For all experiments, we use a discount factor of γ = 0.99, an advantage discount factor λ = 0.95, and a KL-divergence step size of δKL = 0.02. For experiments which consider cost constraints we adopt a target cost δc = 0.0 to pursue a zero-violation policy. Other unique hyper-parameters for each algorithm are hand-tuned to attain reasonable performance. Further details are provided in Table 19. |