reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Learning Gaussian DAGs from Network Data

Authors: Hangjian Li, Oscar Hernan Madrid Padilla, Qing Zhou

JMLR 2024 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive numerical experiments also demonstrate that, by jointly estimating the DAG structure and the sample correlation, our method achieves much higher accuracy in structure learning. When the node ordering is unknown, through experiments on synthetic and real data, we show that our algorithm can be used to estimate the correlations between samples, with which we can de-correlate the dependent data to signiﬁcantly improve the performance of classical DAG learning methods.
Researcher Affiliation	Academia	Hangjian Li EMAIL Oscar Hernan Madrid Padilla EMAIL Qing Zhou EMAIL Department of Statistics and Data Science University of California, Los Angeles Los Angeles, CA 90095, USA
Pseudocode	Yes	Algorithm 1: Block coordinate descent (BCD) algorithm Input: X, Θ(0), bΩ, ρ, A(H ), T while max n bΘ(t+1) bΘ(t) f, b B(t+1) b B(t) f o > ρ and t < T do for j = 1, . . . , p do ˆβ(t+1) j Lasso regression (9) end bΘ(t+1) graphical Lasso with support restriction (10) t t + 1 end Output: b B b B(t), bΘ bΘ(t)
Open Source Code	No	The paper does not explicitly state that source code is provided or offer a link to a code repository for the methodology described. It refers to third-party R packages but not its own implementation.
Open Datasets	Yes	The RNA-seq data set used in this section were generated by Chu et al. (2016), and is accessible through the Gene Expression Omnibus (GEO) series accession number GSE75748. We took four real DAGs from the bnlearn repository (Scutari, 2010): Andes, Hailfinder, Barley, Hepar2, and two real undirected networks from tnet (Opsahl, 2009): facebook (Opsahl and Panzarasa, 2009) and celegans n306 (Watts and Strogatz, 1998).
Dataset Splits	No	The paper mentions generating test sample matrices for evaluation but does not specify explicit training/validation/test splits with percentages, absolute counts, or references to predefined splits for reproducibility. For example, it says: "in each setting, we generated a test sample matrix Xtest from the true distribution for each of the 10 repeated simulations" (Section 5.1.1) and discusses "test data log-likelihood", but not how data is split for training models.
Hardware Specification	No	The paper does not provide specific hardware details such as CPU or GPU models, memory, or cluster configurations used for running its experiments.
Software Dependencies	No	The paper mentions several R packages like `rcausal` (Ramsey et al., 2017), `sparsebn` (Aragam et al., 2019b), and `pcalg` (Kalisch et al., 2012) for comparative methods. However, it does not specify version numbers for these packages or for the R environment itself, which is necessary for reproducible software dependencies.
Experiment Setup	Yes	To apply the BCD algorithm, we need to set values for λ1 and λ2 in (8). Since the support of Θ is restricted to the edge set of H , we simply ﬁxed λ2 to a small value (λ2 = 0.01) in all experiments. For each data set, we computed a solution path from the largest λ1 max, for which we get an empty DAG, to λ1 min = λ1 max/100. The optimal λ1 was then chosen by minimizing the BIC score over the DAGs on the solution path.