reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

THE ROBUSTNESS OF DIFFERENTIABLE CAUSAL DISCOVERY IN MISSPECIFIED SCENARIOS

Authors: Huiyang Yi, Yanyan He, Duxin Chen, Mingyu Kang, He Wang, Wenwu Yu

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	This work extensively benchmarks the empirical performance of various mainstream causal discovery algorithms, which assume i.i.d. data, under eight model assumption violations. Our experimental results show that differentiable causal discovery methods exhibit robustness... We conduct extensive large-scale experimental evaluations of twelve prominent causal discovery algorithms across eight pivotal model assumption violation scenarios. Our rigorous research endeavor involves executing over 70,000 experiments on more than 2,400 synthetic datasets, ensuring a comprehensive assessment of the algorithm capabilities.
Researcher Affiliation	Academia	1School of Mathematics, Southeast University, Nanjing 210096, China 2School of Cyber Science and Engineering, Southeast University, Nanjing 210096, China EMAIL, EMAIL, EMAIL EMAIL, EMAIL, EMAIL
Pseudocode	No	The paper describes various algorithms (PC, GES, Direct Li NGAM, CAM, Sortn Regress, NOTEARS, GOLEM, NOTEARS-MLP, Gra N-DAG, NOCURL, DAGMA) in text form in Section B and its subsections, but no structured pseudocode or algorithm blocks are provided.
Open Source Code	No	The paper provides links to GitHub repositories for benchmark algorithms (e.g., "We use the implementation of the PC algorithm in causal-learn (Zheng et al., 2024) python package, available at https://github.com/py-why/ causal-learn."). However, these links are for third-party tools that the authors used, not for their own implementation of the benchmarking framework or any new methodology described in the paper itself.
Open Datasets	Yes	Our rigorous research endeavor involves executing over 70,000 experiments on more than 2,400 synthetic datasets... Following the data generation of Zheng et al. (2018; 2020) and Liu et al. (2023), different datasets are generated for both linear and nonlinear vanilla model. We simulate ER and SF graphs based on the number of nodes d {10, 20, 50}, average degree of nodes k {2, 4}. In addition, we consider scenarios with Gaussian Random Partitions (GRP) (Brandes et al., 2003) graph... We also consider the real-world Sachs (Sachs et al., 2005) dataset (see Appendix I).
Dataset Splits	Yes	For each experimental configuration and scenario, 10 datasets of 2000 samples are generated... The proportion of data from e1 is P1 {0.1, 0.3, 0.5, 0.7, 0.9} , and the proportion from e2 is 1 P1.
Hardware Specification	No	The paper does not provide specific details about the hardware (e.g., GPU/CPU models, memory) used for running the experiments. While it discusses runtime results, it lacks the specific hardware specifications.
Software Dependencies	No	The paper mentions several Python packages used for benchmark methods (e.g., "causal-learn (Zheng et al., 2024) python package", "g Castle (Zhang et al., 2021) python package"). However, it only provides a citation year for the paper describing the package, not a specific version number of the software library itself, which is required for reproducibility.
Experiment Setup	Yes	Thus, to ensure a fair comparison of various methods, we tune λ1 in {0.005, 0.01, 0.05, 0.5, 2, 5} and tune α in {0.001, 0.01, 0.05, 0.1}.