Causal Discovery with Unobserved Confounding and Non-Gaussian Data

Authors: Y. Samuel Wang, Mathias Drton

JMLR 2023 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We illustrate the effectiveness of our procedure in simulations and an application to an ecology data set.
Researcher Affiliation Academia Y. Samuel Wang EMAIL Department of Statistics and Data Science Cornell University Ithaca, NY 14853, USA Mathias Drton EMAIL Department of Mathematics & Munich Data Science Institute Technical University of Munich 85748 Garching bei M unchen, Germany
Pseudocode Yes Algorithm 1 BANG procedure 1: Input: Data Y Rp n and S Rp p which is the (potentially sample) covariance of Y 2: For all v V , set c pa(v) = and c sib(v) = V \ {v} 3: Set all elements of D Rp p to be 0 and l = 1 4: while maxv |c sib(v)| l do 5: for v V do 6: Prune c sib(v) using Algorithm 2 7: Certify pseudo-parents of v and update c pa(v), c sib(v), and D using Algorithm 3 8: end for 9: if D was updated, reset l = 1; else set l = l + 1 10: end while 11: Remove ancestors which are not parents from c pa(v) for all v V using Algorithm 4 12: Return: E = {(u, v) : u c pa(v)}, E = {{u, v} : u c sib(v)}
Open Source Code Yes Available at https://github.com/ysamwang/ng Bap
Open Datasets Yes Grace et al. (2016) use a structural equation model to examine the relationships between land productivity and the richness of plant diversity. They consider measurements taken at 1126 plots which are locations across 39 different sites.
Dataset Splits No The paper mentions generating synthetic data with varying sample sizes (e.g., "We let n = 500, 1000, 1500" in Section 6.1, and "We let n = 2500, 5000, 7500, 10000, 25000, 50000" in Section 6.2) and the number of replications ("200 replications" or "50 replications"). For the real-world ecology data, it mentions "measurements taken at 1126 plots". However, it does not provide specific train/test/validation splits for any of the datasets, either synthetic or real.
Hardware Specification No The acknowledgments section mentions computational resources, stating: "This research was supported in part through the computational resources and staffcontributions provided for the Mercury high performance computing cluster at The University of Chicago Booth School of Business which is supported by the Office of the Dean." This provides a general name for a computing cluster but lacks specific hardware details such as CPU/GPU models, memory specifications, or other detailed computer specifications used for running experiments.
Software Dependencies No The paper mentions several software implementations and packages used for comparison: "For Parcel Li NGAM we use the Matlab implementation available from the author s website3; for RCD we use the lingam python package4; for FCI+, we use the R package pcalg (Kalisch et al., 2012); and for GBS we use the R package greedy Baps (Nowzohour, 2017)." However, it does not provide specific version numbers for Matlab or the Python/R packages, which are necessary for reproducible software dependencies.
Experiment Setup Yes For FCI+, RCD, and BANG we set the nominal level of each hypothesis test performed to α = .05, .01, .001. For GBS, we allow 100 random restarts, the same number used in the simulations by Nowzohour (2017). For BANG with EL, we set K = 3 for the gamma and lognormal errors (since they are skewed) and let K = 4 for the uniform and T13 (since they are symmetric).