Efficient Policy Evaluation with Safety Constraint for Reinforcement Learning
Authors: Claire Chen, Shuze Liu, Shangtong Zhang
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Empirically, our method is the only existing method to achieve both substantial variance reduction and safety constraint satisfaction. Furthermore, we show our method is even superior to previous methods in both variance reduction and execution safety. In this section, we demonstrate the empirical results comparing our methods against three baselines: (1) the on-policy Monte Carlo estimator, (2) the robust on-policy sampling estimator (ROS, Zhong et al. (2022)), and (3) the offline data informed estimator (ODI, Liu and Zhang (2024)). |
| Researcher Affiliation | Academia | Claire Chen School of Arts and Science University of Virginia EMAIL Shuze Daniel Liu Department of Computer Science University of Virginia EMAIL Shangtong Zhang Department of Computer Science University of Virginia EMAIL |
| Pseudocode | Yes | Algorithm 1: Safety-Constrained Optimal Policy Evaluation (SCOPE) |
| Open Source Code | No | The paper mentions using the default PPO implementation in Huang et al. (2022) and Open AI Gymnasium, but does not provide any link or statement for their own implementation code. |
| Open Datasets | Yes | Gridworld: We first conduct experiments in Gridworld with n3 states. Each Gridworld is an n n grid with the time horizon also being n. Mu Jo Co: Next, we conduct experiments in Mu Jo Co robot simulation tasks (Todorov et al., 2012). |
| Dataset Splits | No | The offline dataset of each environment contains a total of 1,000 episodes generated by 30 policies with various performances. The ground truth policy performance is estimated by the on-policy Monte Carlo method, running each target policy for 10^6 episodes. Each policy has 30 independent runs, resulting in a total of 30 × 30 = 900 runs. Thus, each curve in Figure 1, Figure 2 and each number in Table 1, Table 3 and Table 4 are averaged from 900 different runs over a wide range of policies, demonstrating a strong statistical significance. |
| Hardware Specification | No | The paper does not provide specific hardware details such as GPU/CPU models, processor types, or memory amounts used for running its experiments. |
| Software Dependencies | No | The paper mentions "proximal policy optimization (PPO) algorithm (Schulman et al., 2017) using the default PPO implementation in Huang et al. (2022)" and "Open AI Gymnasium (Brockman et al., 2016)" and "Adam optimizer (Kingma and Ba, 2015)" but does not specify version numbers for these software components or libraries. |
| Experiment Setup | Yes | All hyperparameters are tuned offline based on Fitted Q-learning loss. We leverage a one-hidden-layer neural network and test the neural network size with [64, 128, 256]. We then choose 64 as the final size. We also test the learning rate for Adam optimizer with [1e-5, 1e-4, 1e-3, 1e-2] and finally choose to use the default learning rate 1e-3 as learning rate for Adam optimizer (Kingma and Ba, 2015). For the benchmark algorithms, we use their reported hyperparameters (Zhong et al., 2022; Liu and Zhang, 2024). |