reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Extreme Value Policy Optimization for Safe Reinforcement Learning

Authors: Shiqing Gao, Yihang Zhou, Shuai Shao, Haoyu Luo, Yiheng Bing, Jiaxin Ding, Luoyi Fu, Xinbing Wang

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments show that EVO significantly reduces constraint violations during training while maintaining competitive policy performance compared to baselines. Experimental results across multiple environments validate that EVO significantly reduces constraint violations while maintaining strong policy performance.
Researcher Affiliation	Academia	Shiqing Gao 1 Yihang Zhou 1 Shuai Shao 1 Haoyu Luo 1 Yiheng Bing 1 Jiaxin Ding 1 Luoyi Fu 1 Xinbing Wang 1 1Shanghai Jiao Tong University, Shanghai, China. Correspondence to: Jiaxin Ding <EMAIL>.
Pseudocode	Yes	Algorithm 1 EVO: Extreme Value policy Optimization
Open Source Code	Yes	We provide the code for EVO in https: //github.com/Shiqing Gao/EVO.
Open Datasets	Yes	Environment. The environments in our experiments consist of Safety Gymnasium and Mu Jo Co (Ji et al., 2024). The Safety Gymnasium tasks simulate a simplified version of autonomous driving, where the robot is required to reach a goal while avoiding obstacles. The Safety Mu Jo Co tasks focus on robot motion control, with agents being rewarded for maintaining a straight path while adhering to a speed limit to ensure safety and stability. Details are provided in Appendix C.3. Safety Gym is the standard API for safe reinforcement learning developed by Open AI.
Dataset Splits	No	The environments in our experiments consist of Safety Gymnasium and Mu Jo Co (Ji et al., 2024). All experiments followed uniform conditions to ensure fairness and reproducibility, with a total of 107 training time steps and a maximum trajectory length of 1000 steps. The paper describes simulated environments and training steps, but not specific train/test/validation splits for a static dataset.
Hardware Specification	Yes	All experiments are implemented in Pytorch 2.0.0 and CUDA 11.3 and performed on Ubuntu 20.04.2 LTS with a single GPU (Ge Force RTX 3090).
Software Dependencies	Yes	All experiments are implemented in Pytorch 2.0.0 and CUDA 11.3 and performed on Ubuntu 20.04.2 LTS with a single GPU (Ge Force RTX 3090).
Experiment Setup	Yes	Experiment Setting. All experiments followed uniform conditions to ensure fairness and reproducibility, with a total of 107 training time steps and a maximum trajectory length of 1000 steps. To reduce randomness, 6 random seeds were used for each method, and the results are presented as mean and variance. The parameter settings are in Appendix C.4. The hyperparameters are summarized in Table 3.