Implicit Safe Set Algorithm for Provably Safe Reinforcement Learning

Authors: Weiye Zhao, Feihan Li, Tairan He, Changliu Liu

JAIR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We validate the proposed algorithm on the state-of-the-art Safety Gym benchmark, where it achieves zero safety violations while gaining 95% 9% cumulative reward compared to state-of-the-art safe DRL methods. Furthermore, the resulting algorithm scales well to high-dimensional systems with parallel computing. ... 8 Experimental Results
Researcher Affiliation Academia WEIYE ZHAO, Carnegie Mellon University, United States FEIHAN LI, Carnegie Mellon University, United States TAIRAN HE, Carnegie Mellon University, United States CHANGLIU LIU , Carnegie Mellon University, United States ... EMAIL, EMAIL, EMAIL, EMAIL, Carnegie Mellon University, Pittsburgh, Pennsylvania, United States;
Pseudocode Yes Algorithm 1 Adaptive Momentum Boundary Approximation ... Algorithm 2 Implicit Safe Set Algorithm (ISSA) ... Algorithm 3 Convergence Trigger
Open Source Code Yes Our code is available on Github.1 1https://github.com/intelligent-control-lab/Implicit_Safe_Set_Algorithm
Open Datasets Yes We validate the proposed algorithm on the state-of-the-art Safety Gym benchmark ... We adopt Safety Gym (Ray et al. 2019) as our testing platform to evaluate the effectiveness of the proposed implicit safe set algorithms.
Dataset Splits No The average episode return π½π‘Ÿand the average episodic sum of costs 𝑀𝑐were obtained by averaging over the last five epochs of training to reduce noise. Cost rate πœŒπ‘was just taken from the final epoch. We report the results of these three metrics in Table 4 normalized by PPO results. The paper mentions training and evaluating policies on the Safety Gym benchmark but does not explicitly specify how the dataset within Safety Gym was split into training, validation, or test sets for reproduction purposes.
Hardware Specification No Our experiments use Mu Jo Co s unicycle and quadruped models whose low-level velocity controllers meet this reachability property. ... The underlying dynamics of Safety Gym is directly handled by Mu Jo Co physics simulator (Todorov et al. 2012). This indicates the dynamics is not explicitly accessible but rather can be implicitly evaluated, which is suitable for our proposed implicit safe set algorithm. No specific computational hardware (e.g., GPU, CPU models, memory) used for running the experiments is mentioned.
Software Dependencies No The paper mentions algorithm names like PPO, PPO-Lagrangian, CPO, and PPO-SL, and the physics simulator Mu Jo Co, but does not provide specific version numbers for any software libraries, frameworks, or environments used. For example, it lists
Experiment Setup Yes Table 2. Important hyper-parameters of PPO, PPO-Lagrangian, CPO, PPO-SL and PPO-ISSA: Timesteps per iteration 30000, Policy network hidden layers (256, 256), Value network hidden layers (256, 256), Policy learning rate 0.0004, Value learning rate 0.001, Target KL 0.01, Discounted factor 𝛾 0.99, Advantage discounted factor πœ† 0.97, PPO Clipping πœ– 0.2, TRPO Conjugate gradient damping (N/A) 0.1, TRPO Backtracking steps (N/A) 10, Cost limit (N/A) 0. Safety Index Parameter Constraint size = 0.05 Constraint size = 0.15 n 1 1 k 0.375 0.5 πœ‚ 0 0