reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Conservative Evaluation of Offline Policy Learning

Authors: Hager Radi Abdelwahed, Josiah P. Hanna, Matthew E. Taylor

TMLR 2024 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We demonstrate CEOPL on a range of tasks as well as real-world medical data. This section discusses the experiments and the results of CEOPL on simulated setups in discrete and continuous control tasks.
Researcher Affiliation	Academia	Hager Radi Abdelwahed EMAIL Department of Computing Science University of Alberta, Edmonton, Canada Josiah P. Hanna EMAIL Computer Sciences Department University of Wisconsin Madison Matthew E. Taylor EMAIL Department of Computing Science Alberta Machine Intelligence Institute (Amii) University of Alberta, Edmonton, Canada
Pseudocode	Yes	CEOPL is further explained in Algorithm 1. The output will be a policy trained offline πθ and a confidence lower-bound estimate of its return ˆvδ(πθ). We refer to the algorithm used as bootstrap confidence intervals (BCI) and the pseudo-code is detailed in Algorithm 2 in Appendix A.3.
Open Source Code	No	The paper does not provide concrete access to source code or explicitly state that the code will be released.
Open Datasets	Yes	We use the well-known MIMIC III data (Johnson et al., 2016) for sepsis treatment. To build a decision-making policy for the treatment of septic patients, we use data from the Medical Information Mart for Intensive Care (MIMIC-III) dataset (v1.4) (Johnson et al., 2016).
Dataset Splits	Yes	Given that the available data is fixed, we split the dataset once in the beginning into a train set, used to train the offline policy, and a test set, used for evaluating the policy performance during training. With stratified sampling (Killian et al., 2020), we use a 70/30 train/test split, which maintains the same proportions of each terminal outcome (survival or mortality).
Hardware Specification	No	The paper does not explicitly describe the specific hardware used to run its experiments, only mentioning general support from research grants and organizations without hardware details.
Software Dependencies	No	The paper mentions using OpenAI Gym environments for simulated tasks, but it does not provide specific version numbers for any software dependencies or libraries used in their implementation.
Experiment Setup	Yes	For bootstrapping, we use δ = 0.05 to get a 95% confidence lower-bound using B = 2000 bootstrap estimates... For each iteration, 300 trajectories are sampled, where 20 trajectories are used into the training buffer and 280 trajectories are used for evaluation... k is set to be 29 for BCQ and Double DQN, while BC does 26 policy updates in one iteration... For each iteration, 500 trajectories are sampled, where 100 trajectories goes into the training buffer and 400 trajectories for evaluation... we train the policy for 200k epochs, where we evaluate the policy every 10k epochs.