reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Off-Policy Evaluation under Nonignorable Missing Data

Authors: Han Wang, Yang Xu, Wenbin Lu, Rui Song

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Through a series of numerical experiments, we empirically demonstrate that our proposed estimator yields a more reliable value inference under missing data. The effectiveness of the proposed estimator is empirically demonstrated through a simulation study and a real application to MIMIC-III data.
Researcher Affiliation	Academia	1Department of Statistics, North Carolina State University, USA. Correspondence to: Rui Song <EMAIL>.
Pseudocode	Yes	The complete algorithm is outlined in Algorithm 1 of Appendix E.1.
Open Source Code	Yes	All supplementary code is available at our Github repository.
Open Datasets	Yes	For example, in movie recommendation systems like Movielens (Harper & Konstan, 2015), platforms aim to determine the best strategy for recommending personalized movie genres using historical user rating data. ... We now demonstrate the accuracy and stability of our value estimates by comparing them with existing baselines using a real-world sepsis dataset from the Medical Information Mart for Intensive Care (MIMIC-III v1.4) database (Johnson et al., 2016).
Dataset Splits	Yes	In our implementation, the dataset is split into two parts, with the first part used for learning the optimal policy and the second part for policy evaluation.
Hardware Specification	No	The paper does not explicitly mention any specific hardware (e.g., CPU, GPU models, memory, or cloud instances) used for running the experiments. It describes simulation studies and real-world data applications without providing hardware details.
Software Dependencies	No	The paper mentions several software components like "random forest", "Deep Q-Network", "Dueling Double Deep Q-Network", "Batch-Constrained Deep Q-Learning (BCQ)", "cubic B-spline bases", and the "limited-memory BFGS algorithm". It also states a "pure-python re-implementation" was used for data processing. However, no specific version numbers are provided for any of these software components or libraries.
Experiment Setup	Yes	Throughout the simulation studies, we set the discount factor γ to 0.9. ... In our implementation, we first scale the state variables onto [0, 1] and then construct 6 cubic B-spline bases for each dimension... For a fair comparison, here we fix L = 36 throughout the experiments despite the sample sizes. ... we add a small ridge penalty with weight 10 5 to improve the stability. ... The nonparametric part is approximated using Gaussian kernel with bandwidth hl = c σln 1/3 l ... We pick c = 7.5 in the bandwidth formula... To avoid extremely large inverse weight, we bound the missing propensity below at 0.01. ... For the other three types of Q-learning algorithms, we run for 2 105 iterations with minibatch size 256 and learning rate 1 10 3.