Incentivizing Safer Actions in Policy Optimization for Constrained Reinforcement Learning

Authors: Somnath Hazra, Pallab Dasgupta, Soumyajit Dey

IJCAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Through empirical evaluation on benchmark environments, we demonstrate the efficacy of IP3O compared to the performance of stateof-the-art Safe RL algorithms.
Researcher Affiliation Collaboration Somnath Hazra1 , Pallab Dasgupta2 and Soumyajit Dey1 1Indian Institute of Technology Kharagpur, India 2Synopsys, USA
Pseudocode Yes In Algorithm 1 below we outline the pseudocode for policy updation using our penalty function. Algorithm 1 Policy optimization using IP3O
Open Source Code Yes Code and Supplementary Material available here: https://github.com/somnathhazra/IP3O.
Open Datasets Yes The evaluations were conducted across three widely-used environments: Mu Jo Co Safety Velocity [Ji et al., 2023], Safety Gymnasium [Ray et al., 2019], and Bullet Safety Gymnasium [Gronauer, 2022]. These environments provide diverse challenges that test the agent s ability to maximize cumulative rewards while adhering to predefined safety constraints, as formulated in Equation 1. We also evaluate our approach on multi-agent environments, using the Meta Drive simulator [Li et al., 2022], which are detailed later.
Dataset Splits No The paper uses standard benchmark environments (Mu Jo Co Safety Velocity, Safety Gymnasium, Bullet Safety Gymnasium, Meta Drive simulator) for Reinforcement Learning. In these environments, data is typically generated through interaction, and the concept of fixed training/test/validation *dataset splits* as in supervised learning does not directly apply. The paper mentions evaluation over 10 episodes per evaluation step and training spanning 2000 episodes, but not specific data splits for a static dataset.
Hardware Specification No The paper does not provide specific hardware details (e.g., GPU/CPU models, memory) used for running its experiments. It only discusses the environments and general experimental setup.
Software Dependencies No The paper mentions that "All baseline implementations are adapted from the Omni Safe repository [Ji et al., 2024]", which is a framework. However, it does not specify version numbers for Omni Safe itself or any other key software components like Python, PyTorch, TensorFlow, or CUDA, which are necessary for reproducibility.
Experiment Setup Yes For these experiments, we set the constraint limit to 25 and use α = 0.5. Detailed environment descriptions are provided in the Supplementary Material. As shown in Figure 4, IP3O obtains better returns while maintaining strict compliance with the velocity constraints compared to state-of-the-art algorithms. For these experiments, the constraint limit is 25, and we set α = 0.1. For our evaluations, we set α = 1.0 and a constraint limit of 25 across all scenarios. We conduct ablation studies to evaluate the effect of hyperparameters on the performance of our approach, focusing on two key parameters: the α hyper-parameter, and the cost limit d.