reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Sample, estimate, aggregate: A recipe for causal discovery foundation models

Authors: Menghua Wu, Yujia Bao, Regina Barzilay, Tommi Jaakkola

TMLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experiments on biological and synthetic data confirm that this model generalizes well beyond its training set, runs on graphs with hundreds of variables in seconds, and can be easily adapted to different underlying data assumptions.
Researcher Affiliation	Collaboration	Menghua Wu rmwu{at}mit.edu Department of Computer Science, Massachusetts Institute of Technology Yujia Bao yujia.bao{at}accenture.com Center for Advanced AI, Accenture Regina Barzilay regina{at}csail.mit.edu Department of Computer Science, Massachusetts Institute of Technology Tommi S. Jaakkola tommi{at}csail.mit.edu Department of Computer Science, Massachusetts Institute of Technology
Pseudocode	Yes	Algorithm 1 Resolve marginal estimates of f F 1: Input: Data DG faithful to G 2: Initialize E KN as the complete undirected graph on N nodes. 3: for S Sd+2 do 4: Compute E S = f(DG[S]) 5: for (i, j) E S do 6: Remove (i, j) from E 8: end for 9: for E S {E S}Sd+2 do 10: for v-structure i j k in E S do 11: if {i, j}, {j, k} E and {i, k} E then 12: Assign orientation i j k in E 14: end for 15: end for 16: Propagate orientations in E (optional).
Open Source Code	Yes	1Our code is available at https://github.com/rmwu/sea.
Open Datasets	Yes	We pretrained Sea models on 6,480 synthetic datasets... To assess generalization and robustness, we evaluate on unseen in-distribution and out-of-distribution synthetic datasets, as well as two real biological datasets (Sachs et al., 2005; Replogle et al., 2022), using the versions from Wang et al. (2017); Chevalley et al. (2025).
Dataset Splits	Yes	We generated 90 training, 5 validation, and 5 testing datasets for each combination.
Hardware Specification	Yes	The models were trained across 2 NVIDIA RTX A6000 GPUs and 60 CPU cores.
Software Dependencies	No	Observational algorithm implementations were provided by the causal-learn library (Zheng et al., 2024).
Experiment Setup	Yes	Our model was implemented with 4 layers with 8 attention heads and hidden dimension 64. Our model was trained using the Adam W optimizer with a learning rate of 1e-4 (Loshchilov et al., 2017). See B.4 for additional details about hyperparameters.