reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Toward Falsifying Causal Graphs Using a Permutation-Based Test

Authors: Elias Eulig, Atalanti A. Mastakouri, Patrick Blöbaum, Michaela Hardt, Dominik Janzing

AAAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Evaluating on both simulated and real data sets from various domains, including biology and cloud monitoring, we demonstrate that the true graph is not falsiﬁed by our metric, whereas the wrong graphs given by a hypothetical user are likely to be falsiﬁed.
Researcher Affiliation	Collaboration	1German Cancer Research Center (DKFZ) 2Heidelberg University 3Amazon Research T ubingen 4University Hospital T ubingen EMAIL, EMAIL, EMAIL, EMAIL
Pseudocode	No	The paper describes a permutation test and related concepts in text, but does not include a clearly labeled pseudocode block or algorithm section.
Open Source Code	Yes	An implementation of our metric is available in the Python package Do Why (Bl obaum et al. 2024). Project https://eeulig.github.io/dag-falsiﬁcation
Open Datasets	Yes	Protein Signaling Network (Sachs et al. 2005) This open dataset contains quantitative measurements... Auto MPG (Quinlan 1993) The Auto MPG dataset contains eight attributes...
Dataset Splits	No	The paper mentions using N=10^3 observations for synthetic data, and specific N values for real-world datasets (e.g., N=853, N=398, N=432), but does not specify any training, validation, or test splits for these datasets.
Hardware Specification	No	The paper includes a runtime table (Table 3) for different graph sizes but does not specify the CPU, GPU, or any other hardware components used for these measurements.
Software Dependencies	No	The paper mentions software like the 'Python package Do Why' and 'The R package dagitty' and algorithms like 'GCM with boosted decision trees', 'Li NGAM', 'CAM', and 'NOTEARS', but does not provide specific version numbers for any of these software components or libraries.
Experiment Setup	Yes	For all experiments on synthetic data we sample T = 103 node permutations and use datasets with N = 103 observations. To investigate the effect of N and T on p LMC we run ablation studies on nonlinear data with N, T ∈ {101, 102, 103, 104}. For = 5% we reject the hypotheses that the graphs are as bad as random ones.