Constrained Reinforcement Learning with Smoothed Log Barrier Function
Authors: Baohe Zhang, Yuan Zhang, Hao Zhu, Shengchao Yan, Thomas Brox, Joschka Boedecker
TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Through extensive experiments on challenging constrained control tasks, we demonstrate that CSAC-LB significantly outperforms baselines by consistently achieving high returns while strictly adhering to safety constraints. Our results establish CSAC-LB as a robust and stable solution for applying RL to safety-critical domains. |
| Researcher Affiliation | Academia | Baohe Zhang EMAIL Department of Computer Science University of Freiburg Yuan Zhang EMAIL Department of Computer Science University of Freiburg Hao Zhu EMAIL Department of Computer Science University of Freiburg Shengchao Yan EMAIL Department of Computer Science University of Freiburg Thomas Brox EMAIL Department of Computer Science University of Freiburg Joschka Boedecker EMAIL Department of Computer Science University of Freiburg |
| Pseudocode | Yes | Algorithm 1 Constrained Soft Actor-Critic with Log Barriers (CSAC-LB) |
| Open Source Code | No | The paper does not provide explicit access to source code for the methodology described. It mentions 'Omnisafe' as an infrastructure for accelerating safe reinforcement learning research (Ji et al., 2024) in related work, but this is a third-party project and not the authors' own code release for CSAC-LB. |
| Open Datasets | Yes | Environment To evaluate the generalization of our algorithm CSAC-LB and the other baselines, we conduct experiments in 10 tasks introduced by Luo & Ma (2021) and Ji et al. (2023). These high-risk, high-reward tasks cover a wide spectrum ranging from 2D navigation task to continuous control tasks and from simple pendulum Tilt task to high-dimensional humanoid locomotion task with speed limits. All tasks are depicted in the Fig. 2. Notably, there are four tasks derived from Pendulum and Inverted Pendulum. A detailed introduction can be found in the Appendix. A.1. Safety-Gymnasium Benchmark These tasks are from the standard Safety-Gymnasium benchmark (Ji et al., 2023), testing navigation and locomotion under explicit safety constraints. |
| Dataset Splits | No | The paper mentions using tasks from the 'standard Safety-Gymnasium benchmark' and other referenced tasks, but it does not explicitly provide details about the training, validation, or test dataset splits, such as percentages or sample counts. While these benchmarks typically have predefined splits, the paper text itself does not specify them. |
| Hardware Specification | No | The paper does not provide specific hardware details such as GPU models, CPU types, or other computer specifications used to run the experiments. |
| Software Dependencies | No | The paper mentions various algorithms and frameworks (e.g., SAC, PPO, DDPG), but it does not specify the version numbers for any software dependencies, such as programming languages, libraries, or solvers. |
| Experiment Setup | Yes | All algorithms were re-implemented and trained with the same hyperparameters, as detailed in Table. 2. Table 2: Hyperparameter Configuration Hyperparameter Value Common Parameters Batch Size 256 Network Architecture [256, 256] Discount Factor (γ) 0.99 Random Steps 100 Learning Rate 1 10 3 Actor Update Frequency 1 Critic Update Frequency 1 Polyak Update Factor 0.005 Initial Temperature 1.0 Normalize Reward Yes CSAC-LB Parameters Offset 1.0 Log Barrier Factor 4.0 WCSAC Parameter Damp Scale 10 CPO Parameters GAE Lambda (λ) 0.95 Line Search Max Iterations 15 CG Max Steps 15 Normalize Advantage Yes |