reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Since Faithfulness Fails: The Performance Limits of Neural Causal Discovery

Authors: Mateusz Olko, Mateusz Gajewski, Joanna Wojciechowska, Mikołaj Morzy, Piotr Sankowski, Piotr Miłoś

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our systematic evaluation highlights significant room for improvement in their accuracy when uncovering causal structures. We identify a fundamental limitation: unavoidable likelihood score estimation errors disallow distinguishing the true structure, even for small graphs and relatively large sample sizes. Furthermore, we identify the faithfulness property as a critical bottleneck: (i) it is likely to be violated across any reasonable dataset size range, and (ii) its violation directly undermines the performance of neural penalized-likelihood discovery methods.
Researcher Affiliation	Collaboration	1IDEAS NCBR, Warsaw, Poland 2University of Warsaw, Warsaw, Poland 3Poznan University of Technology, Poznan, Poland 4MIM Solutions, Warsaw, Poland 5Research Institute IDEAS, Warsaw, Poland 6Institute of Mathematics, Polish Academy of Sciences, Warsaw, Poland 7deepsense.ai, Warsaw, Poland.
Pseudocode	Yes	Pseudocode of the method described in Sec. 3 is provided in Algorithm 1. Algorithm 1 Overview of NN-Opt
Open Source Code	No	The paper does not contain an explicit statement about the release of source code for the methodology described, nor does it provide a direct link to a code repository.
Open Datasets	No	We generate synthetic data with a known ground-truth causal structure. We consider causal DAGs with only five nodes V = {1, . . . , 5}. We generate these DAGs using the Erdos-Renyi model with the expected number of 5 edges.
Dataset Splits	No	The paper discusses evaluating causal discovery on "subsets of varying sizes" and "datasets with varying number of observational samples, ranging from 20 to 8,000 observations", but does not provide specific training, validation, or test dataset splits in terms of percentages, counts, or predefined files.
Hardware Specification	No	We gratefully acknowledge Polish high-performance computing infrastructure PLGrid (HPC Center: ACK Cyfronet AGH) for providing computer facilities and support within computational grant no. PLG/2024/016906.
Software Dependencies	No	The paper mentions several neural causal discovery methods like DCDI, Di BS, Bayes DAG, and SDCD, and discusses hyperparameter tuning for them, but does not provide specific version numbers for these or any other software libraries or dependencies used.
Experiment Setup	Yes	To ensure a fair comparison across all methods, we perform systematic hyperparameter tuning, selecting the best-performing parameters for each method We employ a grid search approach based on the parameter ranges suggested by the original authors. This process optimizes key variables such as regularization coefficients, sparsity controls, and kernel configurations. Details can be found in Appendix E.2. DCDI Grid search: ... Selected: Regularization coefficient = 1, learning rate = 0.001, Augmented Lagrangian tolerance = 10^-8.