Constraint-Conditioned Actor-Critic for Offline Safe Reinforcement Learning

Authors: Zijian Guo, Weichao Zhou, Shengao Wang, Wenchao Li

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Empirical evaluations on the DSRL benchmarks show that CCAC significantly outperforms existing methods for learning adaptive, safe, and high-reward policies. The paper includes a dedicated section '5 EXPERIMENTS' and various performance tables and figures like 'Table 1: Evaluation results of the normalized reward and cost.', 'Figure 2: Evaluation results of reward and cost in Run and Circle tasks with different percentages of datasets being used for training.', and ablation studies in 'Figure 5: Ablation study: average performance of CCAC and its variants in Run and Circle tasks.' and 'Figure 6: Ablation study: Qc-values plots.'.
Researcher Affiliation Academia All authors are affiliated with 'Boston University' as indicated by '1Division of Systems Engineering, Boston University 2Department of Electrical and Computer Engineering, Boston University' and their email addresses use the '@bu.edu' domain, which is characteristic of an academic institution.
Pseudocode Yes The paper states, 'The overall method is summarized in Algorithm 1 in Appendix C.1.' Appendix C.1 contains 'Algorithm 1 Cost-Conditioned Actor-Critic (CCAC)', which provides a structured pseudocode block for the proposed method.
Open Source Code Yes The abstract explicitly states, 'The code is available at https://github.com/BU-DEPENDLab/CCAC.'
Open Datasets Yes The paper mentions using public benchmarks: 'Tasks. The Bullet-Safety-Gym (Gronauer, 2022) and Safety-Gymnasium (Ji et al., 2023) are public benchmarks... and DSRL (Liu et al., 2023a), a comprehensive benchmark specialized for offline safe RL, provides the offline datasets.' Additionally, the Reproducibility Statement confirms, 'The datasets used are provided from a publicly available benchmark that uses simulated dynamical control environments...'
Dataset Splits Yes The paper discusses specific dataset manipulations for experiments: 'To assess the effect of OOD states and actions, we use different percentages of data to train policies and then evaluate their performance.' and clarifies in Table 4, 'p = 1.0/0.75/0.5/0.25 means 100%, 75%, 50%, and 25% of the offline data is used during training respectively.' It also mentions using 'data density filter' and 'partial data filter' for creating modified datasets, as shown in Figure 8 and described in Appendix B.2.
Hardware Specification No No specific hardware details (such as CPU, GPU models, or memory specifications) used for running the experiments are mentioned in the paper. The 'Reproducibility Statement' only mentions 'simulated dynamical control environments' but does not specify the computing hardware used.
Software Dependencies No The paper refers to using existing implementations and frameworks for baselines, e.g., 'we use the OSRL1 implementation' and 'We adopt the CQL-Saute from this CQL implementation2'. However, it does not provide specific version numbers for software components like Python, PyTorch, or other libraries essential for replication.
Experiment Setup Yes Appendix C.2, titled 'HYPERPARAMETERS', provides a detailed 'Table 3' listing specific values for various parameters used in the experiments, including 'Actor hidden size [256, 256]', 'Critic hidden size [256, 256]', 'VAE/CVAE hidden size [512, 512, 64, 512, 512]', 'Episode length', 'Batch size', 'Training steps', 'γ 0.99', 'Actor learning rate 1e-4', 'Critic learning rate 1e-3', 'VAE/CVAE learning rate 1e-3', and 'Critic ensemble 4'. It also mentions PID parameters for certain baselines.