reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Constrained Reinforcement Learning with Smoothed Log Barrier Function

Authors: Baohe Zhang, Yuan Zhang, Hao Zhu, Shengchao Yan, Thomas Brox, Joschka Boedecker

TMLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Through extensive experiments on challenging constrained control tasks, we demonstrate that CSAC-LB significantly outperforms baselines by consistently achieving high returns while strictly adhering to safety constraints. Our results establish CSAC-LB as a robust and stable solution for applying RL to safety-critical domains.
Researcher Affiliation	Academia	Baohe Zhang EMAIL Department of Computer Science University of Freiburg Yuan Zhang EMAIL Department of Computer Science University of Freiburg Hao Zhu EMAIL Department of Computer Science University of Freiburg Shengchao Yan EMAIL Department of Computer Science University of Freiburg Thomas Brox EMAIL Department of Computer Science University of Freiburg Joschka Boedecker EMAIL Department of Computer Science University of Freiburg
Pseudocode	Yes	Algorithm 1 Constrained Soft Actor-Critic with Log Barriers (CSAC-LB)
Open Source Code	No	The paper does not provide explicit access to source code for the methodology described. It mentions 'Omnisafe' as an infrastructure for accelerating safe reinforcement learning research (Ji et al., 2024) in related work, but this is a third-party project and not the authors' own code release for CSAC-LB.
Open Datasets	Yes	Environment To evaluate the generalization of our algorithm CSAC-LB and the other baselines, we conduct experiments in 10 tasks introduced by Luo & Ma (2021) and Ji et al. (2023). These high-risk, high-reward tasks cover a wide spectrum ranging from 2D navigation task to continuous control tasks and from simple pendulum Tilt task to high-dimensional humanoid locomotion task with speed limits. All tasks are depicted in the Fig. 2. Notably, there are four tasks derived from Pendulum and Inverted Pendulum. A detailed introduction can be found in the Appendix. A.1. Safety-Gymnasium Benchmark These tasks are from the standard Safety-Gymnasium benchmark (Ji et al., 2023), testing navigation and locomotion under explicit safety constraints.
Dataset Splits	No	The paper mentions using tasks from the 'standard Safety-Gymnasium benchmark' and other referenced tasks, but it does not explicitly provide details about the training, validation, or test dataset splits, such as percentages or sample counts. While these benchmarks typically have predefined splits, the paper text itself does not specify them.
Hardware Specification	No	The paper does not provide specific hardware details such as GPU models, CPU types, or other computer specifications used to run the experiments.
Software Dependencies	No	The paper mentions various algorithms and frameworks (e.g., SAC, PPO, DDPG), but it does not specify the version numbers for any software dependencies, such as programming languages, libraries, or solvers.
Experiment Setup	Yes	All algorithms were re-implemented and trained with the same hyperparameters, as detailed in Table. 2. Table 2: Hyperparameter Configuration Hyperparameter Value Common Parameters Batch Size 256 Network Architecture [256, 256] Discount Factor (γ) 0.99 Random Steps 100 Learning Rate 1 10 3 Actor Update Frequency 1 Critic Update Frequency 1 Polyak Update Factor 0.005 Initial Temperature 1.0 Normalize Reward Yes CSAC-LB Parameters Offset 1.0 Log Barrier Factor 4.0 WCSAC Parameter Damp Scale 10 CPO Parameters GAE Lambda (λ) 0.95 Line Search Max Iterations 15 CG Max Steps 15 Normalize Advantage Yes