Conservative Evaluation of Offline Policy Learning

Authors: Hager Radi Abdelwahed, Josiah P. Hanna, Matthew E. Taylor

TMLR 2024 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We demonstrate CEOPL on a range of tasks as well as real-world medical data. This section discusses the experiments and the results of CEOPL on simulated setups in discrete and continuous control tasks.
Researcher Affiliation Academia Hager Radi Abdelwahed EMAIL Department of Computing Science University of Alberta, Edmonton, Canada Josiah P. Hanna EMAIL Computer Sciences Department University of Wisconsin Madison Matthew E. Taylor EMAIL Department of Computing Science Alberta Machine Intelligence Institute (Amii) University of Alberta, Edmonton, Canada
Pseudocode Yes CEOPL is further explained in Algorithm 1. The output will be a policy trained offline πθ and a confidence lower-bound estimate of its return ˆvδ(πθ). We refer to the algorithm used as bootstrap confidence intervals (BCI) and the pseudo-code is detailed in Algorithm 2 in Appendix A.3.
Open Source Code No The paper does not provide concrete access to source code or explicitly state that the code will be released.
Open Datasets Yes We use the well-known MIMIC III data (Johnson et al., 2016) for sepsis treatment. To build a decision-making policy for the treatment of septic patients, we use data from the Medical Information Mart for Intensive Care (MIMIC-III) dataset (v1.4) (Johnson et al., 2016).
Dataset Splits Yes Given that the available data is fixed, we split the dataset once in the beginning into a train set, used to train the offline policy, and a test set, used for evaluating the policy performance during training. With stratified sampling (Killian et al., 2020), we use a 70/30 train/test split, which maintains the same proportions of each terminal outcome (survival or mortality).
Hardware Specification No The paper does not explicitly describe the specific hardware used to run its experiments, only mentioning general support from research grants and organizations without hardware details.
Software Dependencies No The paper mentions using OpenAI Gym environments for simulated tasks, but it does not provide specific version numbers for any software dependencies or libraries used in their implementation.
Experiment Setup Yes For bootstrapping, we use δ = 0.05 to get a 95% confidence lower-bound using B = 2000 bootstrap estimates... For each iteration, 300 trajectories are sampled, where 20 trajectories are used into the training buffer and 280 trajectories are used for evaluation... k is set to be 29 for BCQ and Double DQN, while BC does 26 policy updates in one iteration... For each iteration, 500 trajectories are sampled, where 100 trajectories goes into the training buffer and 400 trajectories for evaluation... we train the policy for 200k epochs, where we evaluate the policy every 10k epochs.