Extreme Value Policy Optimization for Safe Reinforcement Learning

Authors: Shiqing Gao, Yihang Zhou, Shuai Shao, Haoyu Luo, Yiheng Bing, Jiaxin Ding, Luoyi Fu, Xinbing Wang

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments show that EVO significantly reduces constraint violations during training while maintaining competitive policy performance compared to baselines. Experimental results across multiple environments validate that EVO significantly reduces constraint violations while maintaining strong policy performance.
Researcher Affiliation Academia Shiqing Gao 1 Yihang Zhou 1 Shuai Shao 1 Haoyu Luo 1 Yiheng Bing 1 Jiaxin Ding 1 Luoyi Fu 1 Xinbing Wang 1 1Shanghai Jiao Tong University, Shanghai, China. Correspondence to: Jiaxin Ding <EMAIL>.
Pseudocode Yes Algorithm 1 EVO: Extreme Value policy Optimization
Open Source Code Yes We provide the code for EVO in https: //github.com/Shiqing Gao/EVO.
Open Datasets Yes Environment. The environments in our experiments consist of Safety Gymnasium and Mu Jo Co (Ji et al., 2024). The Safety Gymnasium tasks simulate a simplified version of autonomous driving, where the robot is required to reach a goal while avoiding obstacles. The Safety Mu Jo Co tasks focus on robot motion control, with agents being rewarded for maintaining a straight path while adhering to a speed limit to ensure safety and stability. Details are provided in Appendix C.3. Safety Gym is the standard API for safe reinforcement learning developed by Open AI.
Dataset Splits No The environments in our experiments consist of Safety Gymnasium and Mu Jo Co (Ji et al., 2024). All experiments followed uniform conditions to ensure fairness and reproducibility, with a total of 107 training time steps and a maximum trajectory length of 1000 steps. The paper describes simulated environments and training steps, but not specific train/test/validation splits for a static dataset.
Hardware Specification Yes All experiments are implemented in Pytorch 2.0.0 and CUDA 11.3 and performed on Ubuntu 20.04.2 LTS with a single GPU (Ge Force RTX 3090).
Software Dependencies Yes All experiments are implemented in Pytorch 2.0.0 and CUDA 11.3 and performed on Ubuntu 20.04.2 LTS with a single GPU (Ge Force RTX 3090).
Experiment Setup Yes Experiment Setting. All experiments followed uniform conditions to ensure fairness and reproducibility, with a total of 107 training time steps and a maximum trajectory length of 1000 steps. To reduce randomness, 6 random seeds were used for each method, and the results are presented as mean and variance. The parameter settings are in Appendix C.4. The hyperparameters are summarized in Table 3.