reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Safety Representations for Safer Policy Learning

Authors: Kaustubh Mani, Vincent Mai, Charlie Gauthier, Annie Chen, Samer Nashed, Liam Paull

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Empirical evaluations across diverse environments show that our method significantly improves task performance while reducing constraint violations during training, underscoring its effectiveness in balancing exploration with safety. 4 EXPERIMENTS 5.1 LEARNING SAFETY REPRESENTATIONS ALONGSIDE POLICY 6 ABLATIONS & ANALYSIS
Researcher Affiliation	Collaboration	1Mila, Université de Montréal, Quebec, Canada 2 Hydro-Québec Research Center, Quebec, Canada 3 Standford University 4 Canada CIFAR AI Chair EMAIL
Pseudocode	No	The paper describes the SRPL framework and the training process of the S2C model using prose and mathematical formulas (e.g., Equation 3 for the loss function), but it does not include a distinct pseudocode block or algorithm box.
Open Source Code	No	Our implementations were based on top of FSRL Liu et al. (2024) and Omnisafe Ji et al. (2024) codebases. 5https://github.com/PKU-Alignment/omnisafe 6https://github.com/liuzuxin/FSRL The provided links refer to existing external codebases that the authors built upon, rather than explicitly stating the release of their own specific SRPL implementation or providing a direct link to it.
Open Datasets	Yes	To demonstrate the usefulness of state-centric safety representations, we perform experiments on Island Navigation (Leike et al., 2017), a grid world environment designed for evaluating safe exploration approaches. First is a manipulation task Adroit Hand Pen (Rajeswaran et al., 2017) where a 24-degree of freedom Shadow Hand agent needs to learn to manipulate a pen from a start orientation to a randomly sampled goal orientation. Next, we have an autonomous driving environment Safe Meta Drive (Li et al., 2022), where an RL agent is learning to drive on the road while avoiding traffic... Finally, we evaluate our method on the Safety Gym (Ray et al., 2019) environment on tasks Point Goal1 and Point Button1. Additionally, we also show results on Mujoco locomotion environments (Ant, Hopper and Walker2d) in the Appendix.
Dataset Splits	No	The paper describes the training of RL agents in various environments (Island Navigation, Adroit Hand Pen, Safe Meta Drive, Safety Gym, Mujoco) and evaluation metrics (episodic return, total failures, cost-rate, success rate) over 'training runs across five seeds' and 'over 2M timesteps'. However, it does not specify explicit static train/test/validation dataset splits with percentages or sample counts, which is typical for online reinforcement learning where data is generated through interaction.
Hardware Specification	No	The paper does not provide specific details about the hardware used for running the experiments, such as CPU or GPU models, memory, or cloud computing resources.
Software Dependencies	No	Our implementations were based on top of FSRL Liu et al. (2024) and Omnisafe Ji et al. (2024) codebases. The paper mentions using FSRL and Omnisafe codebases but does not provide specific version numbers for these frameworks or any other software dependencies (e.g., Python, PyTorch, CUDA versions).
Experiment Setup	Yes	Implementation details: We model the safety distribution over a fixed safety horizon Hs << H, relying on the assumption that information about near-term safety is more important and an agent can safely navigate the state space with this information. Instead of modelling the distribution over all time steps between [1, Hs], we split this range into bins to further reduce the dimensionality of the safety representation. An ablation over the choices of bin size and safety horizon Hs is included in the Appendix A.6.1. A.3.1 PRACTICAL CONSIDERATIONS: To address this issue we went with the cost limit of 0.1 for our experiments for both SRPL and vanilla versions of the baseline algorithms. A.3.2 HYPERPARAMETER DETAILS: As described earlier we chose a bin size of 4 and safety horizon Hs = 80 for Point Goal1 and Point Button1 environments and a bin size of 4 and safety horizon Hs = 40 for all the other environments. The batch size for training the S2C model was chosen between 512 and 5000 and we found that a batch size of 5000 led to better performance in Safety Gym environments and 512 for all the other environments. Additionally, we optimize for hyperparameters like when to update the S2C model update freq which was set to 100 for on-policy baselines and 20000 for off-policy baselines. Additionally, we used a learning rate of 1e 6 or 1e 5 for on-policy experiments and a learning rate of 1e 3 for off-policy baselines. The S2C model has the same network architecture as the policy which in most cases is an MLP with two hidden layers of size 64.