Doubly Optimal Policy Evaluation for Reinforcement Learning

Authors: Shuze Liu, Claire Chen, Shangtong Zhang

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Empirically, compared with previous works, we show our method reduces variance substantially and achieves superior empirical performance.
Researcher Affiliation Academia Shuze Daniel Liu Department of Computer Science University of Virginia EMAIL Claire Chen School of Arts and Science University of Virginia EMAIL Shangtong Zhang Department of Computer Science University of Virginia EMAIL
Pseudocode Yes Algorithm 1: Doubly Optimal (DOpt) Policy Evaluation
Open Source Code No The paper discusses the source code of a third-party tool or platform that the authors used ('We use the the default PPO implementation in Huang et al. (2022)'), but does not provide their own implementation code for the methodology described in this paper.
Open Datasets Yes Mu Jo Co: We also conduct experiments in Mu Jo Co robot simulation tasks (Todorov et al., 2012).
Dataset Splits No To learn functions qπ,t and uπ,t, we split the offline data into a training set and a test set. (No specific percentages or sample counts for the splits are provided.)
Hardware Specification No The paper does not mention any specific hardware specifications like CPU or GPU models used for running the experiments.
Software Dependencies No The paper mentions several algorithms and tools (Fitted Q-Evaluation, PPO, Adam optimizer, Gymnasium) and refers to an implementation from another paper (Huang et al., 2022), but it does not provide specific version numbers for any libraries, programming languages, or other ancillary software components used for their experiments.
Experiment Setup Yes We choose a one-hidden-layer neural network and test the neural network size with [64, 128, 256] and choose 64 as the final size. We test the learning rate for Adam optimizer with [1e-5, 1e-4, 1e-3, 1e-2] and choose to use the default learning rate 1e-3 as learning rate for Adam optimizer (Kingma and Ba, 2015).