Latent Safety-Constrained Policy Approach for Safe Offline Reinforcement Learning

Authors: Prajwal Koirala, Zhanhong Jiang, Soumik Sarkar, Cody Fleming

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive empirical evaluation on benchmark datasets, including challenging autonomous driving scenarios, demonstrates that our approach not only maintains safety compliance but also excels in cumulative reward optimization, surpassing existing methods. Additional visualizations provide further insights into the effectiveness and underlying mechanisms of our approach.
Researcher Affiliation Academia Prajwal Koirala, Zhanhong Jiang, Soumik Sarkar & Cody Fleming Iowa State University Ames, Iowa, USA EMAIL
Pseudocode Yes Algorithm 1: LSPC Training
Open Source Code Yes The code is available here.
Open Datasets Yes Our evaluation uses the DSRL benchmark (Liu et al., 2023a), focusing on normalized return and normalized cost to measure performance.
Dataset Splits No The paper mentions evaluating methods on each dataset with three distinct target cost thresholds and across three random seeds, and in transfer experiments it is "evaluated across 20 episodes". However, it does not provide specific training/test/validation dataset splits (e.g., percentages, sample counts, or explicit standard split references) for the underlying pre-collected datasets.
Hardware Specification Yes The device used for reporting the training times in this section is a Dell Alienware Aurora R12 system with an 11th gen Intel Core i7 processor, 32 GB DDR4, and an NVIDIA Ge Force RTX 3070 8GB GPU. All experiments were run on a CUDA device.
Software Dependencies No The paper mentions using "CORL (Tarasov et al., 2024) implementation for Implicit Q-Learning (IQL)" and a "codebase inspired by the OSRL (Liu et al., 2023a) style", along with "Py Bullet physics simulator" and "Mu Jo Co physics simulator", but does not provide specific version numbers for any of these software components.
Experiment Setup Yes Table 2: Common Hyperparameters for IQL and AWR Hyperparameter Value Batch size (|B|) 1024 Discount factor (γ) 0.99 Soft update rate for Q-networks (T ) 0.005 Inverse temperature for reward 2.0 Inverse temperature for cost 2.0 Learning rates for all parameters 3 10 4 Asymmetric L2 loss coefficient (ξ) 0.7 Max exp advantage weight (both cost and reward) 200.0