reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Revisiting inference after prediction

Authors: Keshav Motwani, Daniela Witten

JMLR 2023 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In Sections 3 and 4, we investigate the empirical consequences of our ﬁndings from Section 2. These empirical investigations paint a clear picture: namely, that failure to target the correct parameter has substantial statistical consequences for the proposal of Wang et al. (2020), in the form of hypothesis tests that fail to control the Type 1 error, and conﬁdence intervals that fail to attain the nominal coverage. The proposal of Angelopoulos et al. (2023) does not suﬀer these consequences, as it targets the correct parameter. We close with a discussion in Section 5. In this paper, we use capitals to represent a random variable and lower case to represent its realization. Vectors of length equal to the number of observations, or matrices whose rows correspond to the observations, are in bold.
Researcher Affiliation	Academia	Keshav Motwani EMAIL Department of Biostatistics University of Washington Seattle, WA Daniela Witten EMAIL Departments of Biostatistics and Statistics University of Washington Seattle, WA
Pseudocode	Yes	Algorithm 1 Bootstrap correction of Wang et al. (2020). The goal is to conduct inference on the association between Y and X. X 1. Use (ylab, ˆf(zlab)) to ﬁt the relationship model Y \| ˆf(Z) K( ˆf(Z), φ), yielding ˆφ. X 2. For b = 1, . . . , B: XX 2.1. Sample unlabeled observations with replacement to obtain zb unlab and xb unlab. XX 2.2. Sample outcomes yb\| ˆf(zb unlab) from the relationship model K( ˆf(zb unlab), ˆφ). XX 2.3. Use ( yb, xb unlab) to ﬁt a regression model for the relationship between Y and X, and record the coeﬃcient estimate ˆβb and model-based standard error ˆsb. X 3. Compute the point estimate ˆβ = median{ˆβ1, . . . , ˆβB}. X 4. Compute the nonparametric standard error c SE(ˆβ) = SD{ˆβ1, . . . , ˆβB}. X 5. Compute the parametric standard error c SE(ˆβ) = median{ˆs1, . . . , ˆs B}.
Open Source Code	Yes	Code Availability Scripts to reproduce the results in this manuscript are available at https://github.com/keshav-motwani/Prediction Based Inference/. Our code is based on the code from Wang et al. (2020); we thank the authors for making it publicly accessible.
Open Datasets	No	We consider a simple simulation setting, inspired by the Simulated Data: Continuous case section of Wang et al. (2020). They generate three datasets: a training dataset consisting of realizations of (Z, X, Y ) used to train a machine learning model ˆf( ), a labeled dataset consisting of realizations of (Z, X, Y ), and an unlabeled dataset consisting only of realizations of (Z, X); both the labeled and unlabeled datasets are used for inference3. They consider predictors Z R4 and response Y R, and deﬁne the covariate X Z1. In Wang et al. (2020) s paper, the training, labeled, and unlabeled datasets each consist of 300 observations. Throughout this section, we keep the training sample size ﬁxed at 300 observations, but vary the size of the labeled and unlabeled datasets. As in Wang et al. (2020), we generate the training, labeled, and unlabeled datasets from the same partially linear additive model Y = β0 + β1Z1 + P4 j=2 βjgj(Zj) + ϵ. Explanation: The paper describes a simulation study where the authors generate their own datasets (training, labeled, unlabeled) based on a partially linear additive model. It does not use or provide access to any external, publicly available datasets.
Dataset Splits	Yes	They generate three datasets: a training dataset consisting of realizations of (Z, X, Y ) used to train a machine learning model ˆf( ), a labeled dataset consisting of realizations of (Z, X, Y ), and an unlabeled dataset consisting only of realizations of (Z, X); both the labeled and unlabeled datasets are used for inference3. They consider predictors Z R4 and response Y R, and deﬁne the covariate X Z1. In Wang et al. (2020) s paper, the training, labeled, and unlabeled datasets each consist of 300 observations. Throughout this section, we keep the training sample size ﬁxed at 300 observations, but vary the size of the labeled and unlabeled datasets. As in Wang et al. (2020), we generate the training, labeled, and unlabeled datasets from the same partially linear additive model Y = β0 + β1Z1 + P4 j=2 βjgj(Zj) + ϵ. In each replicate of the simulation study, we generate a new labeled and unlabeled dataset as described above. We perform a total of 1,000 simulation replicates. We consider two settings: one under the null (β 1 = 0) and one under the alternative (β 1 = 1). Figures 1, 2, 3, 4 show various sample sizes for nlab and nunlab, e.g., "nlab = 100 nunlab = 1000", "nlab = 0.1nunlab".
Hardware Specification	No	Explanation: The paper does not explicitly mention any specific hardware used for running its experiments, such as GPU models, CPU models, or cloud computing specifications.
Software Dependencies	No	We generate 3 training sets and ﬁt a GAM to each training set, to obtain three ﬁtted models ˆf1, ˆf2, ˆf3. Explanation: The paper mentions using
Experiment Setup	Yes	We consider a simple simulation setting, inspired by the Simulated Data: Continuous case section of Wang et al. (2020). They generate three datasets: a training dataset consisting of realizations of (Z, X, Y ) used to train a machine learning model ˆf( ), a labeled dataset consisting of realizations of (Z, X, Y ), and an unlabeled dataset consisting only of realizations of (Z, X); both the labeled and unlabeled datasets are used for inference3. They consider predictors Z R4 and response Y R, and deﬁne the covariate X Z1. In Wang et al. (2020) s paper, the training, labeled, and unlabeled datasets each consist of 300 observations. Throughout this section, we keep the training sample size ﬁxed at 300 observations, but vary the size of the labeled and unlabeled datasets. As in Wang et al. (2020), we generate the training, labeled, and unlabeled datasets from the same partially linear additive model Y = β0 + β1Z1 + P4 j=2 βjgj(Zj) + ϵ. We consider two settings: one under the null (β 1 = 0) and one under the alternative (β 1 = 1). We generate 3 training sets and ﬁt a GAM to each training set, to obtain three ﬁtted models ˆf1, ˆf2, ˆf3. In each replicate of the simulation study, we generate a new labeled and unlabeled dataset as described above. We perform a total of 1,000 simulation replicates.