Doubly Optimal Policy Evaluation for Reinforcement Learning
Authors: Shuze Liu, Claire Chen, Shangtong Zhang
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Empirically, compared with previous works, we show our method reduces variance substantially and achieves superior empirical performance. |
| Researcher Affiliation | Academia | Shuze Daniel Liu Department of Computer Science University of Virginia EMAIL Claire Chen School of Arts and Science University of Virginia EMAIL Shangtong Zhang Department of Computer Science University of Virginia EMAIL |
| Pseudocode | Yes | Algorithm 1: Doubly Optimal (DOpt) Policy Evaluation |
| Open Source Code | No | The paper discusses the source code of a third-party tool or platform that the authors used ('We use the the default PPO implementation in Huang et al. (2022)'), but does not provide their own implementation code for the methodology described in this paper. |
| Open Datasets | Yes | Mu Jo Co: We also conduct experiments in Mu Jo Co robot simulation tasks (Todorov et al., 2012). |
| Dataset Splits | No | To learn functions qπ,t and uπ,t, we split the offline data into a training set and a test set. (No specific percentages or sample counts for the splits are provided.) |
| Hardware Specification | No | The paper does not mention any specific hardware specifications like CPU or GPU models used for running the experiments. |
| Software Dependencies | No | The paper mentions several algorithms and tools (Fitted Q-Evaluation, PPO, Adam optimizer, Gymnasium) and refers to an implementation from another paper (Huang et al., 2022), but it does not provide specific version numbers for any libraries, programming languages, or other ancillary software components used for their experiments. |
| Experiment Setup | Yes | We choose a one-hidden-layer neural network and test the neural network size with [64, 128, 256] and choose 64 as the final size. We test the learning rate for Adam optimizer with [1e-5, 1e-4, 1e-3, 1e-2] and choose to use the default learning rate 1e-3 as learning rate for Adam optimizer (Kingma and Ba, 2015). |