Learning Gaussian DAGs from Network Data

Authors: Hangjian Li, Oscar Hernan Madrid Padilla, Qing Zhou

JMLR 2024 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive numerical experiments also demonstrate that, by jointly estimating the DAG structure and the sample correlation, our method achieves much higher accuracy in structure learning. When the node ordering is unknown, through experiments on synthetic and real data, we show that our algorithm can be used to estimate the correlations between samples, with which we can de-correlate the dependent data to significantly improve the performance of classical DAG learning methods.
Researcher Affiliation Academia Hangjian Li EMAIL Oscar Hernan Madrid Padilla EMAIL Qing Zhou EMAIL Department of Statistics and Data Science University of California, Los Angeles Los Angeles, CA 90095, USA
Pseudocode Yes Algorithm 1: Block coordinate descent (BCD) algorithm Input: X, Θ(0), bΩ, ρ, A(H ), T while max n bΘ(t+1) bΘ(t) f, b B(t+1) b B(t) f o > ρ and t < T do for j = 1, . . . , p do ˆβ(t+1) j Lasso regression (9) end bΘ(t+1) graphical Lasso with support restriction (10) t t + 1 end Output: b B b B(t), bΘ bΘ(t)
Open Source Code No The paper does not explicitly state that source code is provided or offer a link to a code repository for the methodology described. It refers to third-party R packages but not its own implementation.
Open Datasets Yes The RNA-seq data set used in this section were generated by Chu et al. (2016), and is accessible through the Gene Expression Omnibus (GEO) series accession number GSE75748. We took four real DAGs from the bnlearn repository (Scutari, 2010): Andes, Hailfinder, Barley, Hepar2, and two real undirected networks from tnet (Opsahl, 2009): facebook (Opsahl and Panzarasa, 2009) and celegans n306 (Watts and Strogatz, 1998).
Dataset Splits No The paper mentions generating test sample matrices for evaluation but does not specify explicit training/validation/test splits with percentages, absolute counts, or references to predefined splits for reproducibility. For example, it says: "in each setting, we generated a test sample matrix Xtest from the true distribution for each of the 10 repeated simulations" (Section 5.1.1) and discusses "test data log-likelihood", but not how data is split for training models.
Hardware Specification No The paper does not provide specific hardware details such as CPU or GPU models, memory, or cluster configurations used for running its experiments.
Software Dependencies No The paper mentions several R packages like `rcausal` (Ramsey et al., 2017), `sparsebn` (Aragam et al., 2019b), and `pcalg` (Kalisch et al., 2012) for comparative methods. However, it does not specify version numbers for these packages or for the R environment itself, which is necessary for reproducible software dependencies.
Experiment Setup Yes To apply the BCD algorithm, we need to set values for λ1 and λ2 in (8). Since the support of Θ is restricted to the edge set of H , we simply fixed λ2 to a small value (λ2 = 0.01) in all experiments. For each data set, we computed a solution path from the largest λ1 max, for which we get an empty DAG, to λ1 min = λ1 max/100. The optimal λ1 was then chosen by minimizing the BIC score over the DAGs on the solution path.