reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

State-wise Constrained Policy Optimization

Authors: Weiye Zhao, Rui Chen, Yifan Sun, Feihan Li, Tianhao Wei, Changliu Liu

TMLR 2024 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We demonstrate the effectiveness of our approach on training neural network policies for extensive robot locomotion tasks, where the agent must satisfy a variety of state-wise safety constraints. Our results show that SCPO significantly outperforms existing methods and can handle state-wise constraints in high-dimensional robotics tasks.
Researcher Affiliation	Academia	Weiye Zhao EMAIL Robotics Institute Carnegie Mellon University Rui Chen EMAIL Robotics Institute Carnegie Mellon University Yifan Sun EMAIL Robotics Institute Carnegie Mellon University Feihan Li EMAIL Robotics Institute Carnegie Mellon University Tianhao Wei EMAIL Robotics Institute Carnegie Mellon University Changliu Liu EMAIL Robotics Institute Carnegie Mellon University
Pseudocode	Yes	Algorithm 1 State-wise Constrained Policy Optimization
Open Source Code	Yes	Our code is available on Github1. 1https://github.com/intelligent-control-lab/State Wise_Constrained_Policy_Optimization
Open Datasets	No	New Safety Gym To showcase the effectiveness of our state-wise constrained policy optimization approach, we enhance the widely recognized safe reinforcement learning benchmark environment, Safety Gym (Ray et al., 2019), by incorporating additional robots and constraints. Subsequently, we perform a series of experiments on this augmented environment. Explanation: The paper describes using and enhancing a benchmark environment (Safety Gym) to generate experimental data, rather than using a pre-existing, publicly available dataset.
Dataset Splits	No	We apply an on-policy framework in our experiments. During each epoch the agent interact B times with the environment and then perform a policy update based on the experience collected from the current epoch. Explanation: The paper uses an on-policy reinforcement learning framework in a simulated environment, which generates data dynamically through interaction rather than relying on predefined train/test/validation dataset splits.
Hardware Specification	Yes	Each model is trained on a server with a 48-core Intel(R) Xeon(R) Silver 4214 CPU @ 2.2.GHz, Nvidia RTX A4000 GPU with 16GB memory, and Ubuntu 20.04.
Software Dependencies	No	MuJoCo: A physics engine for model-based control. In 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 5026–5033. IEEE, 2012. Explanation: The paper mentions using MuJoCo but does not provide a specific version number. It also does not list specific version numbers for other key software libraries or frameworks used for implementation (e.g., PyTorch, TensorFlow, Adam optimizer version).
Experiment Setup	Yes	The hyper-parameters used in our experiments are listed in Table 4 as default. ... Our experiments use separate multi-layer perception with tanh activations for the policy network, value network and cost network. Each network consists of two hidden layers of size (64,64). All of the networks are trained using Adam optimizer with learning rate of 0.01. ... For all experiments, we use a discount factor of γ = 0.99, an advantage discount factor λ = 0.95, and a KL-divergence step size of δKL = 0.02. ... For experiments which consider cost constraints we adopt a target cost δc = 0.0 to pursue a zero-violation policy.