Implicit Safe Set Algorithm for Provably Safe Reinforcement Learning
Authors: Weiye Zhao, Feihan Li, Tairan He, Changliu Liu
JAIR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We validate the proposed algorithm on the state-of-the-art Safety Gym benchmark, where it achieves zero safety violations while gaining 95% 9% cumulative reward compared to state-of-the-art safe DRL methods. Furthermore, the resulting algorithm scales well to high-dimensional systems with parallel computing. ... 8 Experimental Results |
| Researcher Affiliation | Academia | WEIYE ZHAO, Carnegie Mellon University, United States FEIHAN LI, Carnegie Mellon University, United States TAIRAN HE, Carnegie Mellon University, United States CHANGLIU LIU , Carnegie Mellon University, United States ... EMAIL, EMAIL, EMAIL, EMAIL, Carnegie Mellon University, Pittsburgh, Pennsylvania, United States; |
| Pseudocode | Yes | Algorithm 1 Adaptive Momentum Boundary Approximation ... Algorithm 2 Implicit Safe Set Algorithm (ISSA) ... Algorithm 3 Convergence Trigger |
| Open Source Code | Yes | Our code is available on Github.1 1https://github.com/intelligent-control-lab/Implicit_Safe_Set_Algorithm |
| Open Datasets | Yes | We validate the proposed algorithm on the state-of-the-art Safety Gym benchmark ... We adopt Safety Gym (Ray et al. 2019) as our testing platform to evaluate the effectiveness of the proposed implicit safe set algorithms. |
| Dataset Splits | No | The average episode return π½πand the average episodic sum of costs ππwere obtained by averaging over the last five epochs of training to reduce noise. Cost rate ππwas just taken from the final epoch. We report the results of these three metrics in Table 4 normalized by PPO results. The paper mentions training and evaluating policies on the Safety Gym benchmark but does not explicitly specify how the dataset within Safety Gym was split into training, validation, or test sets for reproduction purposes. |
| Hardware Specification | No | Our experiments use Mu Jo Co s unicycle and quadruped models whose low-level velocity controllers meet this reachability property. ... The underlying dynamics of Safety Gym is directly handled by Mu Jo Co physics simulator (Todorov et al. 2012). This indicates the dynamics is not explicitly accessible but rather can be implicitly evaluated, which is suitable for our proposed implicit safe set algorithm. No specific computational hardware (e.g., GPU, CPU models, memory) used for running the experiments is mentioned. |
| Software Dependencies | No | The paper mentions algorithm names like PPO, PPO-Lagrangian, CPO, and PPO-SL, and the physics simulator Mu Jo Co, but does not provide specific version numbers for any software libraries, frameworks, or environments used. For example, it lists |
| Experiment Setup | Yes | Table 2. Important hyper-parameters of PPO, PPO-Lagrangian, CPO, PPO-SL and PPO-ISSA: Timesteps per iteration 30000, Policy network hidden layers (256, 256), Value network hidden layers (256, 256), Policy learning rate 0.0004, Value learning rate 0.001, Target KL 0.01, Discounted factor πΎ 0.99, Advantage discounted factor π 0.97, PPO Clipping π 0.2, TRPO Conjugate gradient damping (N/A) 0.1, TRPO Backtracking steps (N/A) 10, Cost limit (N/A) 0. Safety Index Parameter Constraint size = 0.05 Constraint size = 0.15 n 1 1 k 0.375 0.5 π 0 0 |