ActSafe: Active Exploration with Safety Constraints for Reinforcement Learning

Authors: Yarden As, Bhavya, Lenart Treven, Carmelo Sferrazza, Stelian Coros, Andreas Krause

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We empirically show that ACTSAFE obtains state-of-the-art performance in difficult exploration tasks on standard safe deep RL benchmarks while ensuring safety during learning.
Researcher Affiliation Academia Yarden As, Bhavya Sukhija ETH Z urich Lenart Treven ETH Z urich Carmelo Sferrazza UC Berkeley Stelian Coros ETH Z urich Andreas Krause ETH Z urich
Pseudocode Yes Algorithm 1 ACTSAFE: ACTIVE EXPLORATION WITH SAFETY CONSTRAINTS (Expansion stage) Init: Aleatoric uncertainty σ, Probability δ, Statistical model (µ0, σ0, β0(δ)) for episode n = 1, . . . , n do πn = arg maxπ Sn maxf Mn Eτ π,f h PT 1 t=0 σn 1(ˆst, π(ˆst)) i Prepare policy Dn ROLLOUT(πn) Collect data Update (Mn, Sn) D1:n Update statistical model and safe set end for
Open Source Code Yes We provide an open-source implementation of our experiments in https://github.com/ yardenas/actsafe.
Open Datasets Yes Additionally, we show that ACTSAFE scales to high-dimensional environments of the SAFETY-GYM and RWRL benchmarks, excelling in challenging exploration tasks with visual control while also incurring significantly fewer constraint violations.
Dataset Splits No The paper does not explicitly provide training/test/validation dataset splits. Instead, it describes a warm-up period for data collection in an RL setting and mentions sampling episodes for evaluation, which is characteristic of online reinforcement learning rather than predefined dataset splits.
Hardware Specification No The paper does not provide specific hardware details such as exact GPU or CPU models used for running its experiments.
Software Dependencies No The paper mentions several frameworks and tools used, such as Dreamer (Hafner et al., 2023), Recurrent State Space Model (RSSM) from Hafner et al. (2019), Log-Barrier SGD (LBSGD, Usmanova et al., 2024), and i CEM (Pinneri et al., 2021). However, it does not specify version numbers for general software dependencies like Python, PyTorch, TensorFlow, or other core libraries.
Experiment Setup Yes For the state-based tasks, we use GPs to model the dynamics f . For the visual control tasks, we use the RSSM model from Hafner et al. (2019) as described in Section 4.3. We thus validate both the theoretical and practical aspects of ACTSAFE in this section. For both environments, we run the algorithms for ten episodes and then use the learned model to plan w.r.t. known extrinsic rewards after the expansion phase. We assume access to an initial data collection (warm-up) period of 200K environment steps, where the agent collects data and uses it to calibrate its world model. We use the same training procedure across all baselines and environments. We set the cost budget for each episode to d = 25 for SAFETY-GYM. Unless specified otherwise, in all our experiments we use 5 random seeds and report the median and standard error across these seeds. Finally, we use a budget of 5M training steps for each training run. For CARTPOLESWINGUPSPARSE, we use a cost budget of d = 100 and an episode length of T = 1000 steps. In our experiments we treat lambda as a hyper-parameter.