From Dependency to Causality: A Machine Learning Approach

Authors: Gianluca Bontempi, Maxime Flauder

JMLR 2015 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimental results based on synthetic and published data show that the D2C approach is competitive and often outperforms state-of-the-art methods.
Researcher Affiliation Academia Gianluca Bontempi EMAIL Maxime Flauder EMAIL Machine Learning Group, Computer Science Department, Interuniversity Institute of Bioinformatics in Brussels (IB)2, ULB, Universit e Libre de Bruxelles, Brussels, Belgium
Pseudocode Yes 3. The D2C Algorithm The rationale of the D2C algorithm is to predict the existence of a causal link between two variables in a multivariate setting by (i) creating a set of features of the relationship between the members of the Markov Blankets of the two variables and (ii) using a classifier (e.g. a Random Forest as in our experiments) to learn a mapping between the features and the presence of a causal link. We use two sets of features to summarize the relation between the two Markov blankets: the first one accounts for the presence (or the position if the MB is obtained by ranking) of the terms of Mj in Mi and vice versa. For instance it is evident that if zi is a cause of zj we expect to find zi highly ranked between the causal terms of Mj but zj absent (or ranked low) among the causes of Mi. The second set of features is based on the results of the previous section and is obtained by summarizing the distributions of the asymmetric descriptors with a set of quantiles. We propose then an algorithm (D2C) which for each pair of measured variables zi and zj: 1. infers from data the two Markov Blankets (e.g. by using state-of-the-art approaches) Mi and Mj and the subsets Mi\zj = {m(ki), ki = 1, . . . , Ki} and Mj\zi = {m(kj), kj = 1, . . . , Kj}. Most of the existing algorithms associate to the Markov Blanket a ranking such that the most strongly relevant variables are ranked before. 2. computes a set of (conditional) mutual information terms describing the dependency between zi and zj I = [I(zi; zj), I(zi; zj|Mj \ zi), I(zi; zj|Mi \ zj)] (18) 3. computes the positions P (ki) i of the members m(ki) of Mi\zj in the ranking associated to Mj\zi and the positions P (kj) j of the terms m(kj) in the ranking associated to Mi\zj. Note that in case of the absence of a term of Mi in Mj, the position is set to Kj + 1 (respectively Ki + 1). 4. computes the populations based on the asymmetric descriptors introduced in Section 2.3: (a) D1(i, j) = {I(zi; m(kj) j |zj), kj = 1, . . . , Kj} (b) D1(j, i) = {I(zj; m(ki) i |zi), ki = 1, . . . , Ki} (c) D2(i, j) = {I(m(ki) i ; m(kj) j |zj), ki = 1, . . . , Ki, kj = 1, . . . , Kj} and (d) D2(j, i) = {I(m(kj) j ; m(ki) i |zi), ki = 1, . . . , Ki, kj = 1, . . . , Kj} (e) D3(i, j) = {I(zi; m(kj) j ), kj = 1, . . . , Kj}, (f) D3(j, i) = {I(zj, m(ki) i ), ki = 1, . . . , Ki} 5. creates a vector of descriptors x = [I, Q( ˆPi), Q( ˆPj), Q( ˆD1(i, j)), Q( ˆD1(j, i)), Q( ˆD2(i, j)), Q( ˆD2(j, i)), Q( ˆD3(i, j)), Q( ˆD3(j, i))] (19)
Open Source Code Yes The D2C code is available in the CRAN R package D2C (Bontempi et al., 2014). URL http://CRAN.R-project.org/package=D2C. R package version 1.1.
Open Datasets Yes The second part of the assessment relies on the simulated and resimulated data sets proposed in Table 11 of (Aliferis et al., 2010).
Dataset Splits Yes The largest D2C training set is made of D = 60000 pairs xd, yd and is obtained by generating DAGs and storing for each of them the descriptors associated to at most 4 positives examples (i.e. a pair where the node zi is a direct cause of zj) and at most 6 negatives examples (i.e. a pair where the node zi is not a direct cause of zj). A Random Forest classifier is trained on the balanced data set: we use the implementation from the R package random Forest (Liaw and Wiener, 2002) with default setting. The test set is obtained by considering a number of independently simulated DAGs. We consider 190 DAGs for the small and medium configurations and 90 for the large configuration. For each testing DAG we select 4 positives examples (i.e. a pair where the node zi is a direct cause of zj) and 6 negatives examples (i.e. a pair where the node zi is not a direct cause of zj).
Hardware Specification No The paper does not provide specific hardware details like GPU/CPU models or memory amounts used for running experiments.
Software Dependencies Yes The D2C code is available in the CRAN R package D2C (Bontempi et al., 2014). URL http://CRAN.R-project.org/package=D2C. R package version 1.1.
Experiment Setup Yes A Random Forest classifier is trained on the balanced data set: we use the implementation from the R package random Forest (Liaw and Wiener, 2002) with default setting. The largest D2C training set is made of D = 60000 pairs xd, yd and is obtained by generating DAGs and storing for each of them the descriptors associated to at most 4 positives examples (i.e. a pair where the node zi is a direct cause of zj) and at most 6 negatives examples (i.e. a pair where the node zi is not a direct cause of zj). For each testing DAG we select 4 positives examples (i.e. a pair where the node zi is a direct cause of zj) and 6 negatives examples (i.e. a pair where the node zi is not a direct cause of zj). Q returns a set of sample quantiles of a distribution (in the experiments we set the quantiles to 0.1, 0.25, 0.5, 0.75, 0.9).