reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Falsification of Unconfoundedness by Testing Independence of Causal Mechanisms

Authors: Rickard Karlsson, Jh Krijthe

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	To showcase the practical relevance of our approach, we show that our method is able to efficiently detect confounding on both simulated and semi-synthetic data. 6. Experiments We conducted a series of experiments to compare the proposed MINT algorithm with alternative baseline approaches. First, we investigate efficiency with respect to number of samples and number of environments. Next, we validate our theoretical findings by investigating necessary mechanism changes that allow for falsification. We then assessed the sensitivity of our algorithm to (mis)specification in its working models. Lastly, we evaluated all methods under more realistic conditions using semi-synthetic data based on the real-world Twins dataset (Almond et al., 2005), which includes birth data across different geographical locations used as environment labels.
Researcher Affiliation	Academia	1Department of Intelligent Systems, Delft University of Technology, the Netherlands. Correspondence to: Rickard Karlsson <EMAIL>.
Pseudocode	Yes	5. Algorithm We now introduce the Mechanism INdependent Test (MINT) algorithm, which operationalizes our falsification strategy for testing mechanism independence using data from multiple environments. We will use the following notation: for all environments s = 1, . . . , K, we denote the observed data matrices as As = [A1, . . . , Ans] , Ys = [Y1, . . . , Yns] , eΨs = [ eψ(X1), . . . , eψ(Xns)] , and eΦs = [eϕ(X1, A1), . . . , eϕ(Xns, Ans)] . The MINT algorithm can be divided into two steps: In the first stage, for all s = 1, . . . , K, we estimate the parameters (ωs, γs). The estimates are obtained through solving the least-squares problems bωs = arg minω \|\|As eΨsω\|\|2 2 and bγs = arg minγ \|\|Ys eΦsγ\|\|2 2 where \|\| \|\|2 2 denotes the l2-norm. We denote all estimated parameters as bω = [bω1, . . . , bωK] and bγ = [bγ1, . . . , bγK]. In the second stage, we perform a statistical independence test for the null hypothesis H0 : P(ω, γ) = P(ω)P(γ) using the estimated parameters bω and bγ.
Open Source Code	Yes	The code for reproducing our experiments is available at our Git Hub repository.1 https://github.com/RickardKarl/falsification-unconfoundedness
Open Datasets	Yes	In the final experiment, we used data from twin births in the USA between 1989-1991 (Almond et al., 2005) to construct a multi-environment observational dataset with a known causal structure.
Dataset Splits	No	The paper describes how synthetic and semi-synthetic data were generated and used in experiments, including parameters like 'N samples per environment' and 'K environments'. However, it does not specify explicit training, validation, or test dataset splits in the conventional sense for model training and evaluation.
Hardware Specification	No	The paper mentions that research was "facilitated by the computational resources and support of the Delft AI Cluster (DAIC) at TU Delft." However, it does not provide specific details such as GPU models, CPU types, or memory used for the experiments.
Software Dependencies	No	The paper mentions using specific software such as the 'Pearson partial correlation test', the 'non-parametric kernel conditional independence test (KCIT)', and states that they adopted 'the KCIT implementation from the causallearn Python package (Zheng et al., 2024)'. However, it does not provide specific version numbers for these or any other software dependencies.
Experiment Setup	Yes	We measured performance using the falsification rate (probability of falsification) and set the significance level α = 0.05 to control Type 1 errors. We used K = 100 environments with N = 50 samples per environment and d = 1 observed confounder, and set the polynomial degree to p = 2. The noise variables εA and εY were mean-zero Normal distributed with their standard deviation set to 0.5.