Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1]

Provably Safe Reinforcement Learning: Conceptual Analysis, Survey, and Benchmarking

Authors: Hanna Krasowski, Jakob Thumm, Marlon Müller, Lukas Schäfer, Xiao Wang, Matthias Althoff

TMLR 2023 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Therefore, we introduce a categorization of existing provably safe RL methods, present the conceptual foundations for both continuous and discrete action spaces, and empirically benchmark existing methods. Our experiments on an inverted pendulum and a quadrotor stabilization task indicate that action replacement is the best-performing approach for these applications despite its comparatively simple realization. Furthermore, adding a reward penalty, every time the safety verification is engaged, improved training performance in our experiments.
Researcher Affiliation Academia Hanna Krasowski EMAIL Jakob Thumm EMAIL Marlon Müller EMAIL Lukas Schäfer EMAIL Xiao Wang EMAIL Matthias Althoff EMAIL School of Computation, Information and Technology Technical University of Munich
Pseudocode No The paper provides mathematical formulations and descriptions of algorithms but does not include any clearly labeled pseudocode or algorithm blocks. For example, Section 2.1, 2.2, and 2.3 describe methods with equations, but without structured algorithmic steps.
Open Source Code Yes Our implementation is available at Code Ocean: doi.org/10.24433/CO.9209121.v1 . All implementations are based on stable-baselines3 (Raffin et al., 2021). Additionally, the code for the experiments is available at the Code Ocean capsule doi.org/10.24433/CO.9209121.v1 to reproduce our results.
Open Datasets Yes Inverted pendulum The observation and reward are identical to the Open AI Gym Pendulum-V0 9 environment. 9Available at: gymnasium.farama.org/environments/classic_control/pendulum/ 2D quadrotor The quadrotor in our experiments... are based on Mitchell et al. (2019)
Dataset Splits No The paper mentions using "ten random seeds" for training runs and "30 pendulum deployment episodes" / "30 2D Quadrotor deployment episodes" for evaluation. However, it does not specify explicit training/test/validation dataset splits with percentages or sample counts for a predefined dataset. The environments are simulations, not fixed datasets with pre-defined splits.
Hardware Specification No The paper does not provide specific details about the hardware used for running the experiments, such as CPU or GPU models, memory specifications, or cloud computing instance types.
Software Dependencies Yes Our implementation is available at Code Ocean: doi.org/10.24433/CO.9209121.v1 . 8All implementations are based on stable-baselines3 (Raffin et al., 2021). We specify the hyperparameters for all learning algorithms (see Table 5 for PPO, Table 6 for TD3, Table 7 for DQN, and Table 8 for SAC) that are different from the Stable Baselines3 (Raffin et al., 2021) default values.
Experiment Setup Yes Hyperparameters for learning algorithms We specify the hyperparameters for all learning algorithms (see Table 5 for PPO, Table 6 for TD3, Table 7 for DQN, and Table 8 for SAC) that are different from the Stable Baselines3 (Raffin et al., 2021) default values. Additionally, the code for the experiments is available at the Code Ocean capsule doi.org/10.24433/CO.9209121.v1 to reproduce our results.