reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Off-Policy Evaluation with Out-of-Sample Guarantees

Authors: Sofia Ek, Dave Zachariah, Fredrik D. Johansson, Peter Stoica

TMLR 2023 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In the experiments below, we evaluate policies using the limit curves (α, ℓα). We quantify how increasing the credibility of our model assumption, i.e., by increasing Γ, affects the informativeness of the limit curve using (7). We also consider the coverage probability of the curves: Miscoverage gap = α Pπ{Ln+1 > ℓα(D)}. (17) When this gap is positive, the limit is conservative and when the gap is negative the limit is invalid, respectively, at level α. A natural benchmark for the proposed limit (16) in this problem setting is the estimated quantile ℓα(D) = inf ℓ: b FIPW(ℓ; D) 1 α , (18) using the inverse propensity weighted cdf-estimator (10). In all examples below, the limit (16) is computed using sample splits of equal size, i.e., n0 = n/2 . An additional experiment using data from the Infant Health and Development Program (IHDP) can be found in Appendix A.2. The code used for the experiments is made available here https://github.com/sofiaek/ off-policy-evaluation. 5.1 Synthetic data In the first example, we consider synthetic data in order to evaluate the coverage of the derived limit curves. We use a simulation setting similar to Jin et al. (2023). The miscoverage gap (17) is estimated by Monte Carlo simulation using 1000 runs and for each run drawing independent 1000 new samples (Xn+1, Un+1, An+1, Ln+1).
Researcher Affiliation	Academia	Sofia Ek EMAIL Department of Information Technology Uppsala University Dave Zachariah EMAIL Department of Information Technology Uppsala University Fredrik D. Johansson EMAIL Department of Computer Science & Engineering Chalmers University of Technology Petre Stoica EMAIL Department of Information Technology Uppsala University
Pseudocode	Yes	Algorithm 1 Limit curve of policy π Input: Policy pπ(A\|X), training data D, model bp(A\|X), bound Γ 1 and sample split size n0. 1: Randomly split D into D0 and D1. 2: for α {0, . . . , 1} do 3: for β {0, . . . , α} do 4: Compute wβ using (15). 5: Compute ℓα,β using (14). 6: end for 7: Set ℓα to the smallest ℓα,β above. 8: Store (α, ℓα). 9: end for Output: {(α, ℓα)}
Open Source Code	Yes	The code used for the experiments is made available here https://github.com/sofiaek/ off-policy-evaluation.
Open Datasets	Yes	In the second example, we use data from the National Health and Nutrition Examination Survey (NHANES) for the years 2013-2014 to illustrate the use of the proposed method. Following Zhao et al. (2019), we study the effect of seafood consumption on blood mercury levels. The Infant Health and Development Program (IHDP) investigated the impact of early childhood interventions on the health of low birth-weight and premature infants (Health & Program, 1990). Hill (2011) used this study to assemble a data set of 25 covariates X that measured various aspects of the children and their mothers such as birth weight, weeks born preterm, head circumference and age of mother, etc.
Dataset Splits	Yes	In all examples below, the limit (16) is computed using sample splits of equal size, i.e., n0 = n/2 . An additional experiment using data from the Infant Health and Development Program (IHDP) can be found in Appendix A.2. Algorithm 1 Limit curve of policy π Input: Policy pπ(A\|X), training data D, model bp(A\|X), bound Γ 1 and sample split size n0. 1: Randomly split D into D0 and D1. The limit (16) is computed using sample splits where n0 = 0.1n. Another 0.1n samples of the data is randomly used to evaluate the policies in Figure 9.
Hardware Specification	No	The paper does not provide specific hardware details (e.g., GPU/CPU models, memory amounts, or detailed computer specifications) used for running its experiments.
Software Dependencies	No	The paper does not list specific software dependencies with version numbers.
Experiment Setup	Yes	The miscoverage gap (17) is estimated by Monte Carlo simulation using 1000 runs and for each run drawing independent 1000 new samples (Xn+1, Un+1, An+1, Ln+1). We consider a population of individuals with two-dimensional covariates distributed uniformly as The actions are binary A {0, 1} corresponding to not treat and treat , respectively. We want to evaluate a deterministic target policy, described by pπ(A = 0\|X) = 1(X1X2 τ), (19) for different τ [0, 1]. That is, all individuals whose covariate product X1X2 falls below τ are treated. Note that τ = 0 corresponds a treat none policy (A 0 for all X) and τ = 1 corresponds to a treat all policy (A 1 for all X). Below we discuss the resulting losses under this policy using observational data with sample sizes n {250, 500, 1000}. For the training data, the past policy has selected actions as a Bernoulli process: p(A = 0\|X) bp(A = 0\|X) = f c(X1X2 + 1) , c 1 2, 2 , (20) where f( ) is the sigmoid function. We define the past policy in a manner that enables us to control the divergence from the nominal model bp(A\|X) in (20): p(A = 0\|X, U) = 1(U t(X)) h 1 + Γ 1 0 bp(A = 0\|X 1 1) i + 1(U > t(X)) 1 + Γ0 bp(A = 0\|X) 1 1 , (21) where the threshold function t(X) is designed empirically to ensure that the resulting median loss of the past policy for A = 1 is maximized. Our design of the past policy can be seen as the worst case among all unknown past policies that diverge by a factor Γ0 in (6). We fix Γ0 = 2 here, but treat it as unknown.