State-wise Constrained Policy Optimization

Authors: Weiye Zhao, Rui Chen, Yifan Sun, Feihan Li, Tianhao Wei, Changliu Liu

TMLR 2024 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We demonstrate the effectiveness of our approach on training neural network policies for extensive robot locomotion tasks, where the agent must satisfy a variety of state-wise safety constraints. Our results show that SCPO significantly outperforms existing methods and can handle state-wise constraints in high-dimensional robotics tasks.
Researcher Affiliation Academia Weiye Zhao EMAIL Robotics Institute Carnegie Mellon University Rui Chen EMAIL Robotics Institute Carnegie Mellon University Yifan Sun EMAIL Robotics Institute Carnegie Mellon University Feihan Li EMAIL Robotics Institute Carnegie Mellon University Tianhao Wei EMAIL Robotics Institute Carnegie Mellon University Changliu Liu EMAIL Robotics Institute Carnegie Mellon University
Pseudocode Yes Algorithm 1 State-wise Constrained Policy Optimization
Open Source Code Yes Our code is available on Github1. 1https://github.com/intelligent-control-lab/State Wise_Constrained_Policy_Optimization
Open Datasets No New Safety Gym To showcase the effectiveness of our state-wise constrained policy optimization approach, we enhance the widely recognized safe reinforcement learning benchmark environment, Safety Gym (Ray et al., 2019), by incorporating additional robots and constraints. Subsequently, we perform a series of experiments on this augmented environment. Explanation: The paper describes using and enhancing a benchmark environment (Safety Gym) to generate experimental data, rather than using a pre-existing, publicly available dataset.
Dataset Splits No We apply an on-policy framework in our experiments. During each epoch the agent interact B times with the environment and then perform a policy update based on the experience collected from the current epoch. Explanation: The paper uses an on-policy reinforcement learning framework in a simulated environment, which generates data dynamically through interaction rather than relying on predefined train/test/validation dataset splits.
Hardware Specification Yes Each model is trained on a server with a 48-core Intel(R) Xeon(R) Silver 4214 CPU @ 2.2.GHz, Nvidia RTX A4000 GPU with 16GB memory, and Ubuntu 20.04.
Software Dependencies No MuJoCo: A physics engine for model-based control. In 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 5026–5033. IEEE, 2012. Explanation: The paper mentions using MuJoCo but does not provide a specific version number. It also does not list specific version numbers for other key software libraries or frameworks used for implementation (e.g., PyTorch, TensorFlow, Adam optimizer version).
Experiment Setup Yes The hyper-parameters used in our experiments are listed in Table 4 as default. ... Our experiments use separate multi-layer perception with tanh activations for the policy network, value network and cost network. Each network consists of two hidden layers of size (64,64). All of the networks are trained using Adam optimizer with learning rate of 0.01. ... For all experiments, we use a discount factor of γ = 0.99, an advantage discount factor λ = 0.95, and a KL-divergence step size of δKL = 0.02. ... For experiments which consider cost constraints we adopt a target cost δc = 0.0 to pursue a zero-violation policy.