Off-Policy Evaluation under Nonignorable Missing Data

Authors: Han Wang, Yang Xu, Wenbin Lu, Rui Song

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Through a series of numerical experiments, we empirically demonstrate that our proposed estimator yields a more reliable value inference under missing data. The effectiveness of the proposed estimator is empirically demonstrated through a simulation study and a real application to MIMIC-III data.
Researcher Affiliation Academia 1Department of Statistics, North Carolina State University, USA. Correspondence to: Rui Song <EMAIL>.
Pseudocode Yes The complete algorithm is outlined in Algorithm 1 of Appendix E.1.
Open Source Code Yes All supplementary code is available at our Github repository.
Open Datasets Yes For example, in movie recommendation systems like Movielens (Harper & Konstan, 2015), platforms aim to determine the best strategy for recommending personalized movie genres using historical user rating data. ... We now demonstrate the accuracy and stability of our value estimates by comparing them with existing baselines using a real-world sepsis dataset from the Medical Information Mart for Intensive Care (MIMIC-III v1.4) database (Johnson et al., 2016).
Dataset Splits Yes In our implementation, the dataset is split into two parts, with the first part used for learning the optimal policy and the second part for policy evaluation.
Hardware Specification No The paper does not explicitly mention any specific hardware (e.g., CPU, GPU models, memory, or cloud instances) used for running the experiments. It describes simulation studies and real-world data applications without providing hardware details.
Software Dependencies No The paper mentions several software components like "random forest", "Deep Q-Network", "Dueling Double Deep Q-Network", "Batch-Constrained Deep Q-Learning (BCQ)", "cubic B-spline bases", and the "limited-memory BFGS algorithm". It also states a "pure-python re-implementation" was used for data processing. However, no specific version numbers are provided for any of these software components or libraries.
Experiment Setup Yes Throughout the simulation studies, we set the discount factor γ to 0.9. ... In our implementation, we first scale the state variables onto [0, 1] and then construct 6 cubic B-spline bases for each dimension... For a fair comparison, here we fix L = 36 throughout the experiments despite the sample sizes. ... we add a small ridge penalty with weight 10 5 to improve the stability. ... The nonparametric part is approximated using Gaussian kernel with bandwidth hl = c σln 1/3 l ... We pick c = 7.5 in the bandwidth formula... To avoid extremely large inverse weight, we bound the missing propensity below at 0.01. ... For the other three types of Q-learning algorithms, we run for 2 105 iterations with minibatch size 256 and learning rate 1 10 3.