reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Probabilistic Shielding for Safe Reinforcement Learning

Authors: Edwin Hamel-De le Court, Francesco Belardinelli, Alexander W. Goodall

AAAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We demonstrate the viability of our approach with four case studies. The algorithm used to compute an inductive ϵ-upper bound of βM is Interval Iteration (Haddad and Monmege 2018), which is simple in our case as the end components of the MDPs corresponding to the environments are trivial. We use PPO (Schulman et al. 2017) as an RL algorithm to find an optimal policy in the shield. We demonstrate the viability of our approach with five case studies of increasing complexity. For each case study, we compare the safety and the cumulative reward given at each epoch by unshielded PPO (Schulman et al. 2017), PPO-shield (our approach), PPOLagrangian (Ray, Achiam, and Amodei 2019), a combination of a lagrangian approach and PPO, and CPO (Achiam et al. 2017). We use Omnisafe (Ji et al. 2023b) for the implementation of PPO-Lagrangian and CPO. ... Results Figure 2 presents the results of our experiments.
Researcher Affiliation	Academia	Imperial College London EMAIL
Pseudocode	Yes	Algorithm 1: Probabilistic Shielding
Open Source Code	No	The paper mentions using "Omnisafe (Ji et al. 2023b) for the implementation of PPO-Lagrangian and CPO." This refers to a third-party tool used by the authors, not a release of their own source code for the methodology described in the paper. No explicit statement of code release or a link to a repository for their specific implementation is provided.
Open Datasets	No	The paper describes custom environments for its experiments, such as "Media streaming", "Colour bomb gridworld v1", "Bridge crossing v1", and "Pacman". While it cites previous works for similar environments, it does not provide specific access information (links, DOIs, repositories, or formal citations) for these environments as publicly available datasets in the context of their implementation.
Dataset Splits	No	The paper provides environment parameters like "episode length" and "total timesteps" in Table 1, which describe the duration of interactions within the simulated environments. However, it does not specify training, validation, or test splits for a static dataset, as is typical in supervised learning, because the data is generated dynamically through agent interaction with the environment.
Hardware Specification	No	The paper does not provide specific details about the hardware used for running the experiments, such as GPU models, CPU types, or memory specifications. General terms like "experiments" are used without hardware context.
Software Dependencies	No	The paper mentions using "PPO (Schulman et al. 2017)" as an RL algorithm and "Omnisafe (Ji et al. 2023b)" for implementing baselines. However, it does not provide specific version numbers for these or any other software dependencies, which are necessary for reproducible software descriptions.
Experiment Setup	No	The paper provides environment-specific parameters like "random action probability", "episode length", "total timesteps", and "safety bound" in Table 1. However, it does not detail specific hyperparameters for the RL algorithms used, such as learning rates, batch sizes, number of epochs, or optimizer settings, which are critical for reproducing the experimental setup of the model's training.