Offline Safe Reinforcement Learning Using Trajectory Classification
Authors: Ze Gong, Akshat Kumar, Pradeep Varakantham
AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Finally, we extensively evaluate our method using the DSRL benchmark for offline safe RL. Empirically, our method outperforms competitive baselines, achieving higher rewards and better constraint satisfaction across a wide variety of benchmark tasks. |
| Researcher Affiliation | Academia | School of Computing and Information Systems Singapore Management University EMAIL |
| Pseudocode | No | The paper describes the method conceptually and through mathematical formulations (e.g., Equation 4, 5, 7, 11) but does not present a dedicated pseudocode or algorithm block. |
| Open Source Code | No | The paper does not explicitly state that the source code for their method (Tra C) is publicly available, nor does it provide a link to a code repository. |
| Open Datasets | Yes | For evaluation, we adopt the well-established DSRL benchmark (Liu et al. 2023a), designed specifically for offline safe RL approaches. |
| Dataset Splits | Yes | Given the pre-collected offline dataset D, we create two new subdatasets at the trajectory level: one containing desirable trajectories and the other containing undesirable ones. ... Using the predefined cost threshold l, we first split the dataset into two categories based on the cumulative cost, i.e., safe and unsafe. Within the safe trajectories, we further rank them according to cumulative rewards. The top x% of these safe trajectories are selected as desirable. Moreover, we identify the bottom y% of the safe trajectories, along with all unsafe trajectories as undesirable (x, y are hyperparameters that we show how to set empirically). |
| Hardware Specification | No | The paper does not provide specific details regarding the hardware used for running the experiments, such as GPU models, CPU types, or memory specifications. |
| Software Dependencies | No | The paper refers to various benchmarks and environments used for evaluation (DSRL benchmark, Safety Gymnasium, Bullet Safety Gym, Meta Drive) but does not provide specific version numbers for any software dependencies, libraries, or frameworks used in their implementation. |
| Experiment Setup | Yes | Each algorithm is tested on each dataset using three distinct cost thresholds and three random seeds to ensure a fair comparison. ... For the practical implementation of Tra C, we first pretrain the policy using behavior cloning (BC) with the offline dataset, which we then maintain as the reference policy πref. ... We conducted experiments with various selections of x% and y% to examine how different compositions influence the performance of Tra C. ... We tested four different values for each hyperparameter [δ and η], and the results are shown in Table 2. |