Probabilistic Shielding for Safe Reinforcement Learning
Authors: Edwin Hamel-De le Court, Francesco Belardinelli, Alexander W. Goodall
AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We demonstrate the viability of our approach with four case studies. The algorithm used to compute an inductive ϵ-upper bound of βM is Interval Iteration (Haddad and Monmege 2018), which is simple in our case as the end components of the MDPs corresponding to the environments are trivial. We use PPO (Schulman et al. 2017) as an RL algorithm to find an optimal policy in the shield. We demonstrate the viability of our approach with five case studies of increasing complexity. For each case study, we compare the safety and the cumulative reward given at each epoch by unshielded PPO (Schulman et al. 2017), PPO-shield (our approach), PPOLagrangian (Ray, Achiam, and Amodei 2019), a combination of a lagrangian approach and PPO, and CPO (Achiam et al. 2017). We use Omnisafe (Ji et al. 2023b) for the implementation of PPO-Lagrangian and CPO. ... Results Figure 2 presents the results of our experiments. |
| Researcher Affiliation | Academia | Imperial College London EMAIL |
| Pseudocode | Yes | Algorithm 1: Probabilistic Shielding |
| Open Source Code | No | The paper mentions using "Omnisafe (Ji et al. 2023b) for the implementation of PPO-Lagrangian and CPO." This refers to a third-party tool used by the authors, not a release of their own source code for the methodology described in the paper. No explicit statement of code release or a link to a repository for their specific implementation is provided. |
| Open Datasets | No | The paper describes custom environments for its experiments, such as "Media streaming", "Colour bomb gridworld v1", "Bridge crossing v1", and "Pacman". While it cites previous works for similar environments, it does not provide specific access information (links, DOIs, repositories, or formal citations) for these environments as publicly available datasets in the context of their implementation. |
| Dataset Splits | No | The paper provides environment parameters like "episode length" and "total timesteps" in Table 1, which describe the duration of interactions within the simulated environments. However, it does not specify training, validation, or test splits for a static dataset, as is typical in supervised learning, because the data is generated dynamically through agent interaction with the environment. |
| Hardware Specification | No | The paper does not provide specific details about the hardware used for running the experiments, such as GPU models, CPU types, or memory specifications. General terms like "experiments" are used without hardware context. |
| Software Dependencies | No | The paper mentions using "PPO (Schulman et al. 2017)" as an RL algorithm and "Omnisafe (Ji et al. 2023b)" for implementing baselines. However, it does not provide specific version numbers for these or any other software dependencies, which are necessary for reproducible software descriptions. |
| Experiment Setup | No | The paper provides environment-specific parameters like "random action probability", "episode length", "total timesteps", and "safety bound" in Table 1. However, it does not detail specific hyperparameters for the RL algorithms used, such as learning rates, batch sizes, number of epochs, or optimizer settings, which are critical for reproducing the experimental setup of the model's training. |