reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Incentivizing Safer Actions in Policy Optimization for Constrained Reinforcement Learning

Authors: Somnath Hazra, Pallab Dasgupta, Soumyajit Dey

IJCAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Through empirical evaluation on benchmark environments, we demonstrate the efficacy of IP3O compared to the performance of stateof-the-art Safe RL algorithms.
Researcher Affiliation	Collaboration	Somnath Hazra1 , Pallab Dasgupta2 and Soumyajit Dey1 1Indian Institute of Technology Kharagpur, India 2Synopsys, USA
Pseudocode	Yes	In Algorithm 1 below we outline the pseudocode for policy updation using our penalty function. Algorithm 1 Policy optimization using IP3O
Open Source Code	Yes	Code and Supplementary Material available here: https://github.com/somnathhazra/IP3O.
Open Datasets	Yes	The evaluations were conducted across three widely-used environments: Mu Jo Co Safety Velocity [Ji et al., 2023], Safety Gymnasium [Ray et al., 2019], and Bullet Safety Gymnasium [Gronauer, 2022]. These environments provide diverse challenges that test the agent s ability to maximize cumulative rewards while adhering to predefined safety constraints, as formulated in Equation 1. We also evaluate our approach on multi-agent environments, using the Meta Drive simulator [Li et al., 2022], which are detailed later.
Dataset Splits	No	The paper uses standard benchmark environments (Mu Jo Co Safety Velocity, Safety Gymnasium, Bullet Safety Gymnasium, Meta Drive simulator) for Reinforcement Learning. In these environments, data is typically generated through interaction, and the concept of fixed training/test/validation dataset splits as in supervised learning does not directly apply. The paper mentions evaluation over 10 episodes per evaluation step and training spanning 2000 episodes, but not specific data splits for a static dataset.
Hardware Specification	No	The paper does not provide specific hardware details (e.g., GPU/CPU models, memory) used for running its experiments. It only discusses the environments and general experimental setup.
Software Dependencies	No	The paper mentions that "All baseline implementations are adapted from the Omni Safe repository [Ji et al., 2024]", which is a framework. However, it does not specify version numbers for Omni Safe itself or any other key software components like Python, PyTorch, TensorFlow, or CUDA, which are necessary for reproducibility.
Experiment Setup	Yes	For these experiments, we set the constraint limit to 25 and use α = 0.5. Detailed environment descriptions are provided in the Supplementary Material. As shown in Figure 4, IP3O obtains better returns while maintaining strict compliance with the velocity constraints compared to state-of-the-art algorithms. For these experiments, the constraint limit is 25, and we set α = 0.1. For our evaluations, we set α = 1.0 and a constraint limit of 25 across all scenarios. We conduct ablation studies to evaluate the effect of hyperparameters on the performance of our approach, focusing on two key parameters: the α hyper-parameter, and the cost limit d.