reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

The Correlation-assisted Missing Data Estimator

Authors: Timothy I. Cannings, Yingying Fan

JMLR 2022 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We also include practical demonstrations throughout the paper using simulated data and the Terneuzen birth cohort and Brandsma datasets available from CRAN.
Researcher Affiliation	Academia	Timothy I. Cannings EMAIL School of Mathematics University of Edinburgh Edinburgh, UK. Yingying Fan EMAIL Department of Data Sciences and Operations Marshall School of Business University of Southern California Los Angeles, CA 90089, USA
Pseudocode	No	The paper describes methods mathematically and provides examples, but no explicit pseudocode blocks or algorithm sections are found.
Open Source Code	No	The paper mentions using existing R packages like `mice`, `ks`, and `regpro` which are available on CRAN. However, it does not state that their own implementation code for the methodologies described in the paper is released or available.
Open Datasets	Yes	We also include practical demonstrations throughout the paper using simulated data and the Terneuzen birth cohort and Brandsma datasets available from CRAN.
Dataset Splits	Yes	In order to evaluate the performance of the CAM estimator, we take a subsample of size 1000 from the complete-cases to use as a test set (this is ﬁxed throughout). We carry out 100 experiments. In each one, we form a training set by taking another sample of size 200 from the remaining 2464 complete-cases (this sample is diﬀerent in each experiment). The 200 chosen complete-cases are then combined with the observations in Am1, Am2 and Am3 (which are the same in every experiment). Thus, in each experiment, we have n0 = 200, nm1 = 302, nm2 = 182, and nm3 = 108.
Hardware Specification	No	The paper discusses computational cost in general terms, but does not mention specific hardware (e.g., GPU/CPU models, processors, or memory) used for running the experiments.
Software Dependencies	No	The kernel density estimators are computed using the ks package available from CRAN. In the regression settings, we make use of the regpro package available from CRAN. Our implementation utilises the mice R package available from CRAN (van Buuren et al., 2018). While these packages are mentioned, specific version numbers for them or for R itself are not provided.
Experiment Setup	Yes	In each case, we generate a training set of size n {200, 500}, and then introduce missingness by removing ﬁrst component of X independently with probability p1 {0.25, 0.5, 0.75}. The kernel density estimators are computed using the ks package available from CRAN. In particular, we use the kde function with a Gaussian kernel, and the diagonal bandwidth matrices were chosen using the Hpi.diag function. For the Brandsma dataset, we use a Gaussian kernel and the bandwidth was chosen using leave-one-out cross-validation.