reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Offline Safe Reinforcement Learning Using Trajectory Classification

Authors: Ze Gong, Akshat Kumar, Pradeep Varakantham

AAAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Finally, we extensively evaluate our method using the DSRL benchmark for offline safe RL. Empirically, our method outperforms competitive baselines, achieving higher rewards and better constraint satisfaction across a wide variety of benchmark tasks.
Researcher Affiliation	Academia	School of Computing and Information Systems Singapore Management University EMAIL
Pseudocode	No	The paper describes the method conceptually and through mathematical formulations (e.g., Equation 4, 5, 7, 11) but does not present a dedicated pseudocode or algorithm block.
Open Source Code	No	The paper does not explicitly state that the source code for their method (Tra C) is publicly available, nor does it provide a link to a code repository.
Open Datasets	Yes	For evaluation, we adopt the well-established DSRL benchmark (Liu et al. 2023a), designed specifically for offline safe RL approaches.
Dataset Splits	Yes	Given the pre-collected offline dataset D, we create two new subdatasets at the trajectory level: one containing desirable trajectories and the other containing undesirable ones. ... Using the predefined cost threshold l, we first split the dataset into two categories based on the cumulative cost, i.e., safe and unsafe. Within the safe trajectories, we further rank them according to cumulative rewards. The top x% of these safe trajectories are selected as desirable. Moreover, we identify the bottom y% of the safe trajectories, along with all unsafe trajectories as undesirable (x, y are hyperparameters that we show how to set empirically).
Hardware Specification	No	The paper does not provide specific details regarding the hardware used for running the experiments, such as GPU models, CPU types, or memory specifications.
Software Dependencies	No	The paper refers to various benchmarks and environments used for evaluation (DSRL benchmark, Safety Gymnasium, Bullet Safety Gym, Meta Drive) but does not provide specific version numbers for any software dependencies, libraries, or frameworks used in their implementation.
Experiment Setup	Yes	Each algorithm is tested on each dataset using three distinct cost thresholds and three random seeds to ensure a fair comparison. ... For the practical implementation of Tra C, we first pretrain the policy using behavior cloning (BC) with the offline dataset, which we then maintain as the reference policy πref. ... We conducted experiments with various selections of x% and y% to examine how different compositions influence the performance of Tra C. ... We tested four different values for each hyperparameter [δ and η], and the results are shown in Table 2.