reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

From Dependency to Causality: A Machine Learning Approach

Authors: Gianluca Bontempi, Maxime Flauder

JMLR 2015 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experimental results based on synthetic and published data show that the D2C approach is competitive and often outperforms state-of-the-art methods.
Researcher Affiliation	Academia	Gianluca Bontempi EMAIL Maxime Flauder EMAIL Machine Learning Group, Computer Science Department, Interuniversity Institute of Bioinformatics in Brussels (IB)2, ULB, Universit e Libre de Bruxelles, Brussels, Belgium
Pseudocode	Yes	3. The D2C Algorithm The rationale of the D2C algorithm is to predict the existence of a causal link between two variables in a multivariate setting by (i) creating a set of features of the relationship between the members of the Markov Blankets of the two variables and (ii) using a classiﬁer (e.g. a Random Forest as in our experiments) to learn a mapping between the features and the presence of a causal link. We use two sets of features to summarize the relation between the two Markov blankets: the ﬁrst one accounts for the presence (or the position if the MB is obtained by ranking) of the terms of Mj in Mi and vice versa. For instance it is evident that if zi is a cause of zj we expect to ﬁnd zi highly ranked between the causal terms of Mj but zj absent (or ranked low) among the causes of Mi. The second set of features is based on the results of the previous section and is obtained by summarizing the distributions of the asymmetric descriptors with a set of quantiles. We propose then an algorithm (D2C) which for each pair of measured variables zi and zj: 1. infers from data the two Markov Blankets (e.g. by using state-of-the-art approaches) Mi and Mj and the subsets Mi\zj = {m(ki), ki = 1, . . . , Ki} and Mj\zi = {m(kj), kj = 1, . . . , Kj}. Most of the existing algorithms associate to the Markov Blanket a ranking such that the most strongly relevant variables are ranked before. 2. computes a set of (conditional) mutual information terms describing the dependency between zi and zj I = [I(zi; zj), I(zi; zj\|Mj \ zi), I(zi; zj\|Mi \ zj)] (18) 3. computes the positions P (ki) i of the members m(ki) of Mi\zj in the ranking associated to Mj\zi and the positions P (kj) j of the terms m(kj) in the ranking associated to Mi\zj. Note that in case of the absence of a term of Mi in Mj, the position is set to Kj + 1 (respectively Ki + 1). 4. computes the populations based on the asymmetric descriptors introduced in Section 2.3: (a) D1(i, j) = {I(zi; m(kj) j \|zj), kj = 1, . . . , Kj} (b) D1(j, i) = {I(zj; m(ki) i \|zi), ki = 1, . . . , Ki} (c) D2(i, j) = {I(m(ki) i ; m(kj) j \|zj), ki = 1, . . . , Ki, kj = 1, . . . , Kj} and (d) D2(j, i) = {I(m(kj) j ; m(ki) i \|zi), ki = 1, . . . , Ki, kj = 1, . . . , Kj} (e) D3(i, j) = {I(zi; m(kj) j ), kj = 1, . . . , Kj}, (f) D3(j, i) = {I(zj, m(ki) i ), ki = 1, . . . , Ki} 5. creates a vector of descriptors x = [I, Q( ˆPi), Q( ˆPj), Q( ˆD1(i, j)), Q( ˆD1(j, i)), Q( ˆD2(i, j)), Q( ˆD2(j, i)), Q( ˆD3(i, j)), Q( ˆD3(j, i))] (19)
Open Source Code	Yes	The D2C code is available in the CRAN R package D2C (Bontempi et al., 2014). URL http://CRAN.R-project.org/package=D2C. R package version 1.1.
Open Datasets	Yes	The second part of the assessment relies on the simulated and resimulated data sets proposed in Table 11 of (Aliferis et al., 2010).
Dataset Splits	Yes	The largest D2C training set is made of D = 60000 pairs xd, yd and is obtained by generating DAGs and storing for each of them the descriptors associated to at most 4 positives examples (i.e. a pair where the node zi is a direct cause of zj) and at most 6 negatives examples (i.e. a pair where the node zi is not a direct cause of zj). A Random Forest classiﬁer is trained on the balanced data set: we use the implementation from the R package random Forest (Liaw and Wiener, 2002) with default setting. The test set is obtained by considering a number of independently simulated DAGs. We consider 190 DAGs for the small and medium conﬁgurations and 90 for the large conﬁguration. For each testing DAG we select 4 positives examples (i.e. a pair where the node zi is a direct cause of zj) and 6 negatives examples (i.e. a pair where the node zi is not a direct cause of zj).
Hardware Specification	No	The paper does not provide specific hardware details like GPU/CPU models or memory amounts used for running experiments.
Software Dependencies	Yes	The D2C code is available in the CRAN R package D2C (Bontempi et al., 2014). URL http://CRAN.R-project.org/package=D2C. R package version 1.1.
Experiment Setup	Yes	A Random Forest classiﬁer is trained on the balanced data set: we use the implementation from the R package random Forest (Liaw and Wiener, 2002) with default setting. The largest D2C training set is made of D = 60000 pairs xd, yd and is obtained by generating DAGs and storing for each of them the descriptors associated to at most 4 positives examples (i.e. a pair where the node zi is a direct cause of zj) and at most 6 negatives examples (i.e. a pair where the node zi is not a direct cause of zj). For each testing DAG we select 4 positives examples (i.e. a pair where the node zi is a direct cause of zj) and 6 negatives examples (i.e. a pair where the node zi is not a direct cause of zj). Q returns a set of sample quantiles of a distribution (in the experiments we set the quantiles to 0.1, 0.25, 0.5, 0.75, 0.9).