Extreme Value Policy Optimization for Safe Reinforcement Learning
Authors: Shiqing Gao, Yihang Zhou, Shuai Shao, Haoyu Luo, Yiheng Bing, Jiaxin Ding, Luoyi Fu, Xinbing Wang
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments show that EVO significantly reduces constraint violations during training while maintaining competitive policy performance compared to baselines. Experimental results across multiple environments validate that EVO significantly reduces constraint violations while maintaining strong policy performance. |
| Researcher Affiliation | Academia | Shiqing Gao 1 Yihang Zhou 1 Shuai Shao 1 Haoyu Luo 1 Yiheng Bing 1 Jiaxin Ding 1 Luoyi Fu 1 Xinbing Wang 1 1Shanghai Jiao Tong University, Shanghai, China. Correspondence to: Jiaxin Ding <EMAIL>. |
| Pseudocode | Yes | Algorithm 1 EVO: Extreme Value policy Optimization |
| Open Source Code | Yes | We provide the code for EVO in https: //github.com/Shiqing Gao/EVO. |
| Open Datasets | Yes | Environment. The environments in our experiments consist of Safety Gymnasium and Mu Jo Co (Ji et al., 2024). The Safety Gymnasium tasks simulate a simplified version of autonomous driving, where the robot is required to reach a goal while avoiding obstacles. The Safety Mu Jo Co tasks focus on robot motion control, with agents being rewarded for maintaining a straight path while adhering to a speed limit to ensure safety and stability. Details are provided in Appendix C.3. Safety Gym is the standard API for safe reinforcement learning developed by Open AI. |
| Dataset Splits | No | The environments in our experiments consist of Safety Gymnasium and Mu Jo Co (Ji et al., 2024). All experiments followed uniform conditions to ensure fairness and reproducibility, with a total of 107 training time steps and a maximum trajectory length of 1000 steps. The paper describes simulated environments and training steps, but not specific train/test/validation splits for a static dataset. |
| Hardware Specification | Yes | All experiments are implemented in Pytorch 2.0.0 and CUDA 11.3 and performed on Ubuntu 20.04.2 LTS with a single GPU (Ge Force RTX 3090). |
| Software Dependencies | Yes | All experiments are implemented in Pytorch 2.0.0 and CUDA 11.3 and performed on Ubuntu 20.04.2 LTS with a single GPU (Ge Force RTX 3090). |
| Experiment Setup | Yes | Experiment Setting. All experiments followed uniform conditions to ensure fairness and reproducibility, with a total of 107 training time steps and a maximum trajectory length of 1000 steps. To reduce randomness, 6 random seeds were used for each method, and the results are presented as mean and variance. The parameter settings are in Appendix C.4. The hyperparameters are summarized in Table 3. |