Simulating Counterfactuals

Authors: Juha Karvanen, Santtu Tikka, Matti Vihola

JAIR 2024 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental The critical part of Algorithm 4 is the quality of the sample returned by Algorithm 3. In this section, the performance of Algorithm 3 is studied using randomly generated linear Gaussian SCMs. Importantly, we can analytically derive the true distribution for comparison in this particular scenario, see Online Appendix 2 for the details. In the simulation experiment, we generate linear Gaussian SCMs with random graph structure and random coefficients, apply Algorithm 3, and compare the simulated observations with the true conditional normal distribution. The parameters of the simulation and the performance measures are explained in Online Appendix 3. The key results are summarized in Table 1.
Researcher Affiliation Academia Juha Karvanen EMAIL Santtu Tikka EMAIL Matti Vihola EMAIL Department of Mathematics and Statistics University of Jyvaskyla, Finland
Pseudocode Yes Algorithm 1 An algorithm for simulating n observations from a u-monotonic SCM M = (V, U, F, p(u)) on the condition that the value of a continuous variable C is c. The optional argument D0 is an n-row data matrix containing the values of some variables V0 V, U0 U that precede C in the topological order and have already been fixed. Algorithm 2 An algorithm for simulating n observations from causal model M = (V, U, F, p(u)) on the condition that the value of discrete variable C is c. The optional argument D0 is an n-row data matrix containing the values of some variables V0 V, U0 U that precede C in the topological order and have been already fixed. Algorithm 3 An algorithm for simulating n observations from a u-monotonic SMC M = (V, U, F, p(u)) with respect to all continuous variables in C1, . . . , CK under the conditions C = (C1 = c1) (CK = c K). The topological order of the variables in the conditions is C1 < C2 < < CK. The batch size n = n is used in Algorithm 2. Algorithm 4 An algorithm for simulating n observations from a counterfactual distribution under the conditions C = (C1 = c1) (CK = c K) in an SCM M = (V, U, F, p(u)) that is u-monotonic with respect to all continuous variables in the set {C1, . . . , CK}. The topological order of the variables in the conditions is C1 < C2 < < CK. Algorithm 5 An algorithm for evaluating the fairness of a prediction model b Y ( ) in a u-monotonic SCM M with sensitive variables S and response variables Y. The case to be considered is defined by conditions W and C where W = (W1 = w1) (WM = w M) denotes the conditions for Pa(Y) \ S (the non-sensitive observed parents of the responses Y), and C = (C1 = c1) (CK = c K) denotes the conditions for some other variables (which may include S). The argument n defines the number of simulated counterfactual observations. Algorithm 6 Particle Filter(M0:J, G1:J, n)
Open Source Code Yes The simulation code is available at https://github.com/Juha Karvanen/simulating_counterfactuals. ...the R code for the example is available in the repository https://github.com/Juha Karvanen/simulating_counterfactuals in the file fairness_example.R.
Open Datasets Yes We consider variables that are similar to those typically present in creditscoring datasets, such as Statlog (German Credit Data, Hofmann, 1994).
Dataset Splits No To set up the example, we simulated training data from an SCM corresponding to the causal diagram of Figure 1 and fitted prediction models A, B, and C for the default risk using XGBoost (Chen & Guestrin, 2016; Chen, He, Benesty, Khotilovich, Tang, Cho, Chen, Mitchell, Cano, Zhou, Li, M., Xie, Lin, Geng, Li, Y., & Yuan, 2023). These models are opaque AI models for the fairness evaluator who can only see the probability of default predicted by the models. Algorithm 5 was applied to prediction models A, B, and C in 1000 cases that were again simulated from the same SCM.
Hardware Specification No CSC IT Center for Science, Finland, is acknowledged for computational resources.
Software Dependencies Yes Algorithms 1–5 are implemented in the R package R6causal (Karvanen, 2024) which contains R6 (Chang, 2021) classes and methods for SCMs. ...fitted prediction models A, B, and C for the default risk using XGBoost (Chen & Guestrin, 2016; Chen et al., 2023).
Experiment Setup No To set up the example, we simulated training data from an SCM corresponding to the causal diagram of Figure 1 and fitted prediction models A, B, and C for the default risk using XGBoost (Chen & Guestrin, 2016; Chen, He, Benesty, Khotilovich, Tang, Y., Cho, Chen, Mitchell, Cano, Zhou, Li, Xie, Lin, Geng, Li, & Yuan, 2023). These models are opaque AI models for the fairness evaluator who can only see the probability of default predicted by the models. Algorithm 5 was applied to prediction models A, B, and C in 1000 cases that were again simulated from the same SCM. In the algorithm, b Y ( ) was one of the prediction models, M was the SCM whose causal diagram is depicted in Figure 1, sensitive variables S were gender and ethnicity, condition C contained all observed values of the case, and number of observations was n = 1000.