reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Nonparametric Copula Models for Multivariate, Mixed, and Missing Data

Authors: Joseph Feldman, Daniel R. Kowal

JMLR 2024 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive simulation studies demonstrate exceptional modeling and imputation capabilities relative to competing methods, especially with mixed data types, complex missingness mechanisms, and nonlinear dependencies. We conclude with a data analysis that highlights how improper treatment of missing data can distort a statistical analysis, and how the proposed approach oﬀers a resolution.
Researcher Affiliation	Academia	Joseph Feldman EMAIL Department of Statistical Science Duke University, Durham, NC 27708-0251, USA Daniel R. Kowal EMAIL Department of Statistics and Data Science Cornell University, Ithaca, NY 14853-2601, USA Department of Statistics Rice University, Houston, TX 77251-1892, USA
Pseudocode	Yes	Algorithm 1 Bayesian RL Gaussian Copula Gibbs Sampler with Missing Data Algorithm 2 The Margin Adjustment Sampling under the Bayesian RL Gaussian Copula Algorithm 3 Bayesian RL Gaussian Copula Imputation using the Margin Adjustment Algorithm 4 Simulation of a posterior predictive data set of size n under the GMC-MA
Open Source Code	Yes	An R package implementing the proposed approach is available on the author s Git Hub page, found at https://github.com/jfeldman396/GMCImpute
Open Datasets	Yes	Our motivating example comes from a collection of variables (see Table 1) in the National Health and Nutrition Examination Survey (NHANES).
Dataset Splits	No	The paper describes generating '100 hybrid synthetic data sets' for simulation and applying a missingness mechanism to 'all but 300 observations', but it does not specify standard training/test/validation splits for machine learning reproduction. It discusses 'multiple imputations' (m=20) rather than dataset partitioning.
Hardware Specification	Yes	All experiments were run locally on a 2023 Macbook Pro with 32 GB of memory.
Software Dependencies	No	The paper mentions software packages like 'sbgcop package in R' and 'R package mice', but it does not provide specific version numbers for these software dependencies, which is required for reproducible description.
Experiment Setup	Yes	For all of our studies, we use a default value of 0.7p , where p is the dimension of the augmented data matrix under the RPL. Next, we modify the scaling constant, δ, from the normal-inverse Wishart prior speciﬁed for the cluster speciﬁc components from the mixture model on ηi. ... Though we recommend a default value of δ = 10, we ﬁnd that generally, decreasing δ has the eﬀect of increasing the number of clusters discovered. As such, we use δ = 5 in the second simulation study... The total run time for this process was just over an hour. For our simulation examples in Section 6, there were generally many fewer variables. As such, run times were generally much quicker between 1 and 3 minutes with p = 3 and n {500, 1000, 2000} in the ﬁrst exercise, and around 5 minutes when increasing p in the second example. All experiments were run locally on a 2023 Macbook Pro with 32 GB of memory. ...We ran the Gibbs sampler in Appendix C.2 for 20,000 iterations, with the ﬁrst 5,000 discarded as a burn-in and the imputations computed every 50th sample to achieve m = 20.