reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Efficient Policy Evaluation with Safety Constraint for Reinforcement Learning

Authors: Claire Chen, Shuze Liu, Shangtong Zhang

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Empirically, our method is the only existing method to achieve both substantial variance reduction and safety constraint satisfaction. Furthermore, we show our method is even superior to previous methods in both variance reduction and execution safety. In this section, we demonstrate the empirical results comparing our methods against three baselines: (1) the on-policy Monte Carlo estimator, (2) the robust on-policy sampling estimator (ROS, Zhong et al. (2022)), and (3) the offline data informed estimator (ODI, Liu and Zhang (2024)).
Researcher Affiliation	Academia	Claire Chen School of Arts and Science University of Virginia EMAIL Shuze Daniel Liu Department of Computer Science University of Virginia EMAIL Shangtong Zhang Department of Computer Science University of Virginia EMAIL
Pseudocode	Yes	Algorithm 1: Safety-Constrained Optimal Policy Evaluation (SCOPE)
Open Source Code	No	The paper mentions using the default PPO implementation in Huang et al. (2022) and Open AI Gymnasium, but does not provide any link or statement for their own implementation code.
Open Datasets	Yes	Gridworld: We first conduct experiments in Gridworld with n3 states. Each Gridworld is an n n grid with the time horizon also being n. Mu Jo Co: Next, we conduct experiments in Mu Jo Co robot simulation tasks (Todorov et al., 2012).
Dataset Splits	No	The offline dataset of each environment contains a total of 1,000 episodes generated by 30 policies with various performances. The ground truth policy performance is estimated by the on-policy Monte Carlo method, running each target policy for 10^6 episodes. Each policy has 30 independent runs, resulting in a total of 30 × 30 = 900 runs. Thus, each curve in Figure 1, Figure 2 and each number in Table 1, Table 3 and Table 4 are averaged from 900 different runs over a wide range of policies, demonstrating a strong statistical significance.
Hardware Specification	No	The paper does not provide specific hardware details such as GPU/CPU models, processor types, or memory amounts used for running its experiments.
Software Dependencies	No	The paper mentions "proximal policy optimization (PPO) algorithm (Schulman et al., 2017) using the default PPO implementation in Huang et al. (2022)" and "Open AI Gymnasium (Brockman et al., 2016)" and "Adam optimizer (Kingma and Ba, 2015)" but does not specify version numbers for these software components or libraries.
Experiment Setup	Yes	All hyperparameters are tuned offline based on Fitted Q-learning loss. We leverage a one-hidden-layer neural network and test the neural network size with [64, 128, 256]. We then choose 64 as the final size. We also test the learning rate for Adam optimizer with [1e-5, 1e-4, 1e-3, 1e-2] and finally choose to use the default learning rate 1e-3 as learning rate for Adam optimizer (Kingma and Ba, 2015). For the benchmark algorithms, we use their reported hyperparameters (Zhong et al., 2022; Liu and Zhang, 2024).