Deep End-to-end Causal Inference

Authors: Tomas Geffner, Javier Antoran, Adam Foster, Wenbo Gong, Chao Ma, Emre Kiciman, Amit Sharma, Angus Lamb, Martin Kukla, Nick Pawlowski, Agrin Hilmkil, Joel Jennings, Meyer Scetbon, Miltiadis Allamanis, Cheng Zhang

TMLR 2024 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Empirically, we conduct extensive experiments (over a thousand) to show the competitive performance of DECI when compared to relevant baselines for both causal discovery and inference with both synthetic and causal machine learning benchmarks across data types and levels of missingness.
Researcher Affiliation Collaboration 1 University of Massachusetts Amherst 2 University of Cambridge 3 Microsoft Research 4 G-Research Wenbo EMAIL
Pseudocode No The paper describes methods and optimization procedures in detail (e.g., in Appendix B.1 Optimization Details for Causal Discovery), but does not present any structured pseudocode or algorithm blocks.
Open Source Code Yes 1Code for reproducing experiments is available at https://github.com/microsoft/causica/tree/v0.0.0. Latest version of DECI is https://github.com/microsoft/causica.
Open Datasets Yes For the pseudo-real data we consider the Syn TRe N generator (Van den Bulcke et al., 2006)...Finally, we include two semi-synthetic causal inference benchmark datasets for ATE evaluation: Twins (twin birth datasets in the US) (Almond et al., 2005) and IHDP (Infant Health and Development Program data) (Hill, 2011). See Appendix D for all experimental details. IHDP...can be downloaded from https://github.com/AMLab-Amsterdam/CEVAE. TWINS...raw dataset is downloaded from https://github.com/AMLab-Amsterdam/CEVAE.
Dataset Splits Yes All datasets have n=5000 training samples. For the pseudo-real data...we take n=400 for training. Finally, for the real dataset, we use...n=800 samples. We use a 70%/30% train-test split ratio. Each dataset comes with a training set of 2000 samples... We then sub-sample the training dataset to contain additional datasets with 1000, 10,000, 100,000, 500,000, 900,000 points, where each smaller subset is fully contained in each larger one.
Hardware Specification No The paper does not explicitly mention any specific hardware components (e.g., GPU models, CPU models, or detailed computer specifications) used for running the experiments.
Software Dependencies No The paper mentions software like 'gcastle package (Zhang et al., 2021)' and optimizers like 'Adam (Kingma & Ba, 2014)', but it does not specify version numbers for any libraries, packages, or programming languages.
Experiment Setup Yes We use λs = 5 in our prior over graphs eq. (5). For ELBO MC gradients we use the Gumbel softmax method with a hard forward pass and a soft backward pass with temperature of 0.25. The functions eq. (7) used in DECI s SEM, ζ and ℓ, are 2 hidden layer MLPs with 128 hidden units per hidden layer. For the non-Gaussian noise model in eq. (8), the bijection κ is an 8 bin rational quadratic spline (Durkan et al., 2019) with learnt parameters. ...We sample a graph G qϕ(G), and a set of exogenous noise variables z pz. We initialize ρ = 1 and α = 0. At the beginning of step (i) we measure the DAG penalty P1 = Eqϕ(G)h(G). Then, we run step (i) as explained above. At the beginning of step (ii) we measure the DAG penalty again, P2 = Eqϕ(G)h(G). If P2 < 0.65 P1, we leave ρ unchanged and update α α + ρ P2. Otherwise, if P2 0.65 P1, we leave α unchanged and update ρ 10 ρ. We repeat the sequence (i) (ii) for a maximum of 100 steps or until convergence (measured as α or ρ reaching some max value which we set to 1013 for both), whichever happens first. Step (i). Optimizing the objective for some fixed values of ρ and α using Adam (Kingma & Ba, 2014). We optimize the objective for a maximum of 6000 steps or until convergence...We use Adam, initialized with a step-size of 0.01. During training, we reduce the step-size by a factor of 10 if the training loss does not improve for 500 steps.