reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Causal Discovery over High-Dimensional Structured Hypothesis Spaces with Causal Graph Partitioning

Authors: Ashka Shah, Adela Frances DePavia, Nathaniel C Hudson, Ian Foster, Rick Stevens

TMLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We show our algorithm achieves comparable accuracy and a faster time to solution for biologically-tuned synthetic networks and networks up to 104 variables. This makes our method applicable to gene regulatory network inference and other domains with high-dimensional structured hypothesis spaces. Code is available at https://github.com/shahashka/causal_discovery_via_partitioning. In this section we describe experiments for evaluating Algorithm 2 on synthetic random networks with finite data.
Researcher Affiliation	Academia	Ashka Shah EMAIL Department of Computer Science University of Chicago Adela De Pavia EMAIL Committee on Computational and Applied Mathematics University of Chicago Nathaniel Hudson EMAIL Department of Computer Science University of Chicago Ian Foster EMAIL Department of Computer Science University of Chicago Rick Stevens EMAIL Department of Computer Science University of Chicago
Pseudocode	Yes	Algorithm 1: Screen(G, {Hi}N i=1) Algorithm 2: causal_discovery(V, X, G) Algorithm 3: Screen_Finite_Data(G, {Hi}N i=1, X) Algorithm 4: score_and_discard Algorithm 5: loglikelihood_score(i, j, G, X)
Open Source Code	Yes	Code is available at https://github.com/shahashka/causal_discovery_via_partitioning.
Open Datasets	No	Ground truth DAGs, G , are synthetically created using a Barabasi-Albert scale-free model (Barabási & Bonabeau, 2003); both graphs have p = 50 nodes each, with one graph being constructed with a m = 1 edge per node and the second graph being constructed with m = 2 edges per node (edge connections are established via preferential attachment as per the generative model). We use n = 100, 000 samples from the joint, multivariate Gaussian distribution (details on the DAG and data generating process are in Appendix F.1). The paper uses synthetic data generated from described models, but does not provide access information for a publicly available or open dataset.
Dataset Splits	No	The paper generates synthetic data for experiments and does not specify any training, validation, or test splits. It evaluates performance on the generated networks directly.
Hardware Specification	Yes	Table 2: This experiment was run with 2x AMD EPYC 7713 CPU @2GHz with a total of 128 cores and 256 GB of RAM. Table 3: This experiment was run with an AMD Zen 3 (Milan) 7543P CPU @2.8 GHz with 64 cores and 512 GB of RAM. Table 4: This experiment was run with an Intel(R) Xeon(R) Gold 6242 CPU @ 2.80GHz with 64 cores and 192 GB of RAM.
Software Dependencies	No	For GES, PC, and RFCI we use the R implementations in the pcalg library... For disjoint_partition in Algorithm 2 we use greedy modularity (Clauset et al., 2004) from the networkx Python library. The paper mentions software libraries like 'pcalg' (R) and 'networkx' (Python) but does not provide specific version numbers for these, nor for Python or R themselves.
Experiment Setup	Yes	Default parameters: We use the following parameters by default unless stated otherwise. The graph topology is constructed by generating two scale-free networks using the Barabási-Albert generative model (Barabási & Bonabeau, 2003); both graphs have p = 50 nodes each, with one graph being constructed with a m = 1 edge per node and the second graph being constructed with m = 2 edges per node (edge connections are established via preferential attachment as per the generative model). We use n = 100, 000 samples from the joint, multivariate Gaussian distribution (details on the DAG and data generating process are in Appendix F.1). The fraction of additional edges in a perfect superstructure G is 10% of the edges in G (all edges in G are undirected). For causal discovery on subsets we set A to PC, GES, RFCI or DAGMA. Finally, for disjoint_partition in Algorithm 2 we use greedy modularity (Clauset et al., 2004) from the networkx Python library. The parameter settings for A and each partitioning algorithm are detailed in Appendix F.2 and F.3. Appendix F.3: For PC and RFCI the significance level α was set to 0.001. For GES, fixed Gaps which controls which edges are added in the greedy search is set to all the edges not in the superstructure, max Degree is not limited, and adaptive is set to triples. For DAGMA, we used the Linear Model with L2 loss. During optimization lambda1 was set to 0.02, while all other parameters were set to the default.