Robust Transfer of Safety-Constrained Reinforcement Learning Agents

Authors: Markel Zubia, Thiago Simão, Nils Jansen

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental The empirical evaluation shows that this method yields policies that are robust against changes in dynamics, demonstrating safety after transfer to a new environment.
Researcher Affiliation Academia 1Ruhr Univesity Bochum, Germany 2Eindhoven University of Technology, The Netherlands 3Radboud University Nijmegen, The Netherlands
Pseudocode No The paper describes the methodology in Section 5 ('ROBUST GUIDED SAFE EXPLORATION') using natural language without presenting any formal pseudocode or algorithm blocks.
Open Source Code Yes 1The source code is available on https://github.com/ai-fm/safe-and-robust-transfer
Open Datasets Yes We evaluate our method 1 on benchmark environments created using a framework for safe reinforcement learning called Safety-Gymnasium (Ji et al., 2023).
Dataset Splits Yes We restrict the uncertainty set to a finite subset ( U ) by discretizing the values of the parameters to m = m1, . . . , m N, and η = η1, . . . , ηN. In our experiments, we use N = 8 values for each parameter by letting mi = (0.5 + i 1 7 )m and ηi = (0.5 + i 1 7 )η for i = 1, . . . , 8, where m and η correspond to the dynamics in the source task.
Hardware Specification No The paper does not provide specific hardware details such as GPU/CPU models, processor types, or memory amounts used for running the experiments.
Software Dependencies No The paper mentions 'Safety-Gymnasium' as a framework for benchmark environments but does not provide specific version numbers for it or any other software libraries or dependencies used.
Experiment Setup Yes A HYPERPARAMETERS The hyperparameters in our method are summarized in Table 1. All actor and critic networks are modeled by a multilayer perceptron (MLP). Parameter M1 M2 M3 Actor network size [256, 256] [256, 256] [256, 256] Critic network size [256, 256] [256, 256] [256, 256] Size of replay buffer 106 106 106 Batch size 256 256 256 Steps per epoch 2000 2000 2000 Number of epochs 106 106 106 Actor learning rate 5 10 6 5 10 6 5 10 6 Critic learning rate 10 3 10 3 10 3 Lambda learning rate 5 10 7 5 10 7 5 10 7 Safety constraint 5 8 25 Table 1: The hyperparameters used in the experiments.