Safety Representations for Safer Policy Learning
Authors: Kaustubh Mani, Vincent Mai, Charlie Gauthier, Annie Chen, Samer Nashed, Liam Paull
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Empirical evaluations across diverse environments show that our method significantly improves task performance while reducing constraint violations during training, underscoring its effectiveness in balancing exploration with safety. 4 EXPERIMENTS 5.1 LEARNING SAFETY REPRESENTATIONS ALONGSIDE POLICY 6 ABLATIONS & ANALYSIS |
| Researcher Affiliation | Collaboration | 1Mila, Université de Montréal, Quebec, Canada 2 Hydro-Québec Research Center, Quebec, Canada 3 Standford University 4 Canada CIFAR AI Chair EMAIL |
| Pseudocode | No | The paper describes the SRPL framework and the training process of the S2C model using prose and mathematical formulas (e.g., Equation 3 for the loss function), but it does not include a distinct pseudocode block or algorithm box. |
| Open Source Code | No | Our implementations were based on top of FSRL Liu et al. (2024) and Omnisafe Ji et al. (2024) codebases. 5https://github.com/PKU-Alignment/omnisafe 6https://github.com/liuzuxin/FSRL The provided links refer to existing external codebases that the authors built upon, rather than explicitly stating the release of their own specific SRPL implementation or providing a direct link to it. |
| Open Datasets | Yes | To demonstrate the usefulness of state-centric safety representations, we perform experiments on Island Navigation (Leike et al., 2017), a grid world environment designed for evaluating safe exploration approaches. First is a manipulation task Adroit Hand Pen (Rajeswaran et al., 2017) where a 24-degree of freedom Shadow Hand agent needs to learn to manipulate a pen from a start orientation to a randomly sampled goal orientation. Next, we have an autonomous driving environment Safe Meta Drive (Li et al., 2022), where an RL agent is learning to drive on the road while avoiding traffic... Finally, we evaluate our method on the Safety Gym (Ray et al., 2019) environment on tasks Point Goal1 and Point Button1. Additionally, we also show results on Mujoco locomotion environments (Ant, Hopper and Walker2d) in the Appendix. |
| Dataset Splits | No | The paper describes the training of RL agents in various environments (Island Navigation, Adroit Hand Pen, Safe Meta Drive, Safety Gym, Mujoco) and evaluation metrics (episodic return, total failures, cost-rate, success rate) over 'training runs across five seeds' and 'over 2M timesteps'. However, it does not specify explicit static train/test/validation dataset splits with percentages or sample counts, which is typical for online reinforcement learning where data is generated through interaction. |
| Hardware Specification | No | The paper does not provide specific details about the hardware used for running the experiments, such as CPU or GPU models, memory, or cloud computing resources. |
| Software Dependencies | No | Our implementations were based on top of FSRL Liu et al. (2024) and Omnisafe Ji et al. (2024) codebases. The paper mentions using FSRL and Omnisafe codebases but does not provide specific version numbers for these frameworks or any other software dependencies (e.g., Python, PyTorch, CUDA versions). |
| Experiment Setup | Yes | Implementation details: We model the safety distribution over a fixed safety horizon Hs << H, relying on the assumption that information about near-term safety is more important and an agent can safely navigate the state space with this information. Instead of modelling the distribution over all time steps between [1, Hs], we split this range into bins to further reduce the dimensionality of the safety representation. An ablation over the choices of bin size and safety horizon Hs is included in the Appendix A.6.1. A.3.1 PRACTICAL CONSIDERATIONS: To address this issue we went with the cost limit of 0.1 for our experiments for both SRPL and vanilla versions of the baseline algorithms. A.3.2 HYPERPARAMETER DETAILS: As described earlier we chose a bin size of 4 and safety horizon Hs = 80 for Point Goal1 and Point Button1 environments and a bin size of 4 and safety horizon Hs = 40 for all the other environments. The batch size for training the S2C model was chosen between 512 and 5000 and we found that a batch size of 5000 led to better performance in Safety Gym environments and 512 for all the other environments. Additionally, we optimize for hyperparameters like when to update the S2C model update freq which was set to 100 for on-policy baselines and 20000 for off-policy baselines. Additionally, we used a learning rate of 1e 6 or 1e 5 for on-policy experiments and a learning rate of 1e 3 for off-policy baselines. The S2C model has the same network architecture as the policy which in most cases is an MLP with two hidden layers of size 64. |