Latent Safety-Constrained Policy Approach for Safe Offline Reinforcement Learning
Authors: Prajwal Koirala, Zhanhong Jiang, Soumik Sarkar, Cody Fleming
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive empirical evaluation on benchmark datasets, including challenging autonomous driving scenarios, demonstrates that our approach not only maintains safety compliance but also excels in cumulative reward optimization, surpassing existing methods. Additional visualizations provide further insights into the effectiveness and underlying mechanisms of our approach. |
| Researcher Affiliation | Academia | Prajwal Koirala, Zhanhong Jiang, Soumik Sarkar & Cody Fleming Iowa State University Ames, Iowa, USA EMAIL |
| Pseudocode | Yes | Algorithm 1: LSPC Training |
| Open Source Code | Yes | The code is available here. |
| Open Datasets | Yes | Our evaluation uses the DSRL benchmark (Liu et al., 2023a), focusing on normalized return and normalized cost to measure performance. |
| Dataset Splits | No | The paper mentions evaluating methods on each dataset with three distinct target cost thresholds and across three random seeds, and in transfer experiments it is "evaluated across 20 episodes". However, it does not provide specific training/test/validation dataset splits (e.g., percentages, sample counts, or explicit standard split references) for the underlying pre-collected datasets. |
| Hardware Specification | Yes | The device used for reporting the training times in this section is a Dell Alienware Aurora R12 system with an 11th gen Intel Core i7 processor, 32 GB DDR4, and an NVIDIA Ge Force RTX 3070 8GB GPU. All experiments were run on a CUDA device. |
| Software Dependencies | No | The paper mentions using "CORL (Tarasov et al., 2024) implementation for Implicit Q-Learning (IQL)" and a "codebase inspired by the OSRL (Liu et al., 2023a) style", along with "Py Bullet physics simulator" and "Mu Jo Co physics simulator", but does not provide specific version numbers for any of these software components. |
| Experiment Setup | Yes | Table 2: Common Hyperparameters for IQL and AWR Hyperparameter Value Batch size (|B|) 1024 Discount factor (γ) 0.99 Soft update rate for Q-networks (T ) 0.005 Inverse temperature for reward 2.0 Inverse temperature for cost 2.0 Learning rates for all parameters 3 10 4 Asymmetric L2 loss coefficient (ξ) 0.7 Max exp advantage weight (both cost and reward) 200.0 |