Incentivizing Safer Actions in Policy Optimization for Constrained Reinforcement Learning
Authors: Somnath Hazra, Pallab Dasgupta, Soumyajit Dey
IJCAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Through empirical evaluation on benchmark environments, we demonstrate the efficacy of IP3O compared to the performance of stateof-the-art Safe RL algorithms. |
| Researcher Affiliation | Collaboration | Somnath Hazra1 , Pallab Dasgupta2 and Soumyajit Dey1 1Indian Institute of Technology Kharagpur, India 2Synopsys, USA |
| Pseudocode | Yes | In Algorithm 1 below we outline the pseudocode for policy updation using our penalty function. Algorithm 1 Policy optimization using IP3O |
| Open Source Code | Yes | Code and Supplementary Material available here: https://github.com/somnathhazra/IP3O. |
| Open Datasets | Yes | The evaluations were conducted across three widely-used environments: Mu Jo Co Safety Velocity [Ji et al., 2023], Safety Gymnasium [Ray et al., 2019], and Bullet Safety Gymnasium [Gronauer, 2022]. These environments provide diverse challenges that test the agent s ability to maximize cumulative rewards while adhering to predefined safety constraints, as formulated in Equation 1. We also evaluate our approach on multi-agent environments, using the Meta Drive simulator [Li et al., 2022], which are detailed later. |
| Dataset Splits | No | The paper uses standard benchmark environments (Mu Jo Co Safety Velocity, Safety Gymnasium, Bullet Safety Gymnasium, Meta Drive simulator) for Reinforcement Learning. In these environments, data is typically generated through interaction, and the concept of fixed training/test/validation *dataset splits* as in supervised learning does not directly apply. The paper mentions evaluation over 10 episodes per evaluation step and training spanning 2000 episodes, but not specific data splits for a static dataset. |
| Hardware Specification | No | The paper does not provide specific hardware details (e.g., GPU/CPU models, memory) used for running its experiments. It only discusses the environments and general experimental setup. |
| Software Dependencies | No | The paper mentions that "All baseline implementations are adapted from the Omni Safe repository [Ji et al., 2024]", which is a framework. However, it does not specify version numbers for Omni Safe itself or any other key software components like Python, PyTorch, TensorFlow, or CUDA, which are necessary for reproducibility. |
| Experiment Setup | Yes | For these experiments, we set the constraint limit to 25 and use α = 0.5. Detailed environment descriptions are provided in the Supplementary Material. As shown in Figure 4, IP3O obtains better returns while maintaining strict compliance with the velocity constraints compared to state-of-the-art algorithms. For these experiments, the constraint limit is 25, and we set α = 0.1. For our evaluations, we set α = 1.0 and a constraint limit of 25 across all scenarios. We conduct ablation studies to evaluate the effect of hyperparameters on the performance of our approach, focusing on two key parameters: the α hyper-parameter, and the cost limit d. |