Offline Safe Reinforcement Learning Using Trajectory Classification

Authors: Ze Gong, Akshat Kumar, Pradeep Varakantham

AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Finally, we extensively evaluate our method using the DSRL benchmark for offline safe RL. Empirically, our method outperforms competitive baselines, achieving higher rewards and better constraint satisfaction across a wide variety of benchmark tasks.
Researcher Affiliation Academia School of Computing and Information Systems Singapore Management University EMAIL
Pseudocode No The paper describes the method conceptually and through mathematical formulations (e.g., Equation 4, 5, 7, 11) but does not present a dedicated pseudocode or algorithm block.
Open Source Code No The paper does not explicitly state that the source code for their method (Tra C) is publicly available, nor does it provide a link to a code repository.
Open Datasets Yes For evaluation, we adopt the well-established DSRL benchmark (Liu et al. 2023a), designed specifically for offline safe RL approaches.
Dataset Splits Yes Given the pre-collected offline dataset D, we create two new subdatasets at the trajectory level: one containing desirable trajectories and the other containing undesirable ones. ... Using the predefined cost threshold l, we first split the dataset into two categories based on the cumulative cost, i.e., safe and unsafe. Within the safe trajectories, we further rank them according to cumulative rewards. The top x% of these safe trajectories are selected as desirable. Moreover, we identify the bottom y% of the safe trajectories, along with all unsafe trajectories as undesirable (x, y are hyperparameters that we show how to set empirically).
Hardware Specification No The paper does not provide specific details regarding the hardware used for running the experiments, such as GPU models, CPU types, or memory specifications.
Software Dependencies No The paper refers to various benchmarks and environments used for evaluation (DSRL benchmark, Safety Gymnasium, Bullet Safety Gym, Meta Drive) but does not provide specific version numbers for any software dependencies, libraries, or frameworks used in their implementation.
Experiment Setup Yes Each algorithm is tested on each dataset using three distinct cost thresholds and three random seeds to ensure a fair comparison. ... For the practical implementation of Tra C, we first pretrain the policy using behavior cloning (BC) with the offline dataset, which we then maintain as the reference policy πref. ... We conducted experiments with various selections of x% and y% to examine how different compositions influence the performance of Tra C. ... We tested four different values for each hyperparameter [δ and η], and the results are shown in Table 2.