Nonparametric Copula Models for Multivariate, Mixed, and Missing Data
Authors: Joseph Feldman, Daniel R. Kowal
JMLR 2024 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive simulation studies demonstrate exceptional modeling and imputation capabilities relative to competing methods, especially with mixed data types, complex missingness mechanisms, and nonlinear dependencies. We conclude with a data analysis that highlights how improper treatment of missing data can distort a statistical analysis, and how the proposed approach offers a resolution. |
| Researcher Affiliation | Academia | Joseph Feldman EMAIL Department of Statistical Science Duke University, Durham, NC 27708-0251, USA Daniel R. Kowal EMAIL Department of Statistics and Data Science Cornell University, Ithaca, NY 14853-2601, USA Department of Statistics Rice University, Houston, TX 77251-1892, USA |
| Pseudocode | Yes | Algorithm 1 Bayesian RL Gaussian Copula Gibbs Sampler with Missing Data Algorithm 2 The Margin Adjustment Sampling under the Bayesian RL Gaussian Copula Algorithm 3 Bayesian RL Gaussian Copula Imputation using the Margin Adjustment Algorithm 4 Simulation of a posterior predictive data set of size n under the GMC-MA |
| Open Source Code | Yes | An R package implementing the proposed approach is available on the author s Git Hub page, found at https://github.com/jfeldman396/GMCImpute |
| Open Datasets | Yes | Our motivating example comes from a collection of variables (see Table 1) in the National Health and Nutrition Examination Survey (NHANES). |
| Dataset Splits | No | The paper describes generating '100 hybrid synthetic data sets' for simulation and applying a missingness mechanism to 'all but 300 observations', but it does not specify standard training/test/validation splits for machine learning reproduction. It discusses 'multiple imputations' (m=20) rather than dataset partitioning. |
| Hardware Specification | Yes | All experiments were run locally on a 2023 Macbook Pro with 32 GB of memory. |
| Software Dependencies | No | The paper mentions software packages like 'sbgcop package in R' and 'R package mice', but it does not provide specific version numbers for these software dependencies, which is required for reproducible description. |
| Experiment Setup | Yes | For all of our studies, we use a default value of 0.7p , where p is the dimension of the augmented data matrix under the RPL. Next, we modify the scaling constant, δ, from the normal-inverse Wishart prior specified for the cluster specific components from the mixture model on ηi. ... Though we recommend a default value of δ = 10, we find that generally, decreasing δ has the effect of increasing the number of clusters discovered. As such, we use δ = 5 in the second simulation study... The total run time for this process was just over an hour. For our simulation examples in Section 6, there were generally many fewer variables. As such, run times were generally much quicker between 1 and 3 minutes with p = 3 and n {500, 1000, 2000} in the first exercise, and around 5 minutes when increasing p in the second example. All experiments were run locally on a 2023 Macbook Pro with 32 GB of memory. ...We ran the Gibbs sampler in Appendix C.2 for 20,000 iterations, with the first 5,000 discarded as a burn-in and the imputations computed every 50th sample to achieve m = 20. |