Automated Detection of Causal Inference Opportunities: Regression Discontinuity Subgroup Discovery

Authors: Tony Liu, Patrick Lawlor, Lyle Ungar, Konrad Kording, Rahul Ladhania

TMLR 2023 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We show the utility of our approach through both synthetic experiments (Section 5) and a case study using a medical claims dataset consisting of over 60 million patients (Section 6).
Researcher Affiliation Collaboration Tony Liu EMAIL University of Pennsylvania, Roblox
Pseudocode Yes Algorithm 1: RD Sub Group Discovery (RDSGD)
Open Source Code Yes Source code and the data needed to reproduce all figures are available at: https://github.com/tliu526/rdsgd.
Open Datasets Yes Source code and the data needed to reproduce all figures are available at: https://github.com/tliu526/rdsgd. we do however provide descriptive statistics of the presented in Table D.1 as well as anonymized datasets that are sufficient to recreate all figures in this paper.
Dataset Splits Yes We split our data into equally sized samples S1, S2 for each clinical context.
Hardware Specification Yes All simulations were run on a Ubuntu 20.04 LTS server, with a 24-core Intel i9-7920X CPU and 94 GB RAM. Claims data analyses were run on a secure Cent OS Linux 7 server with a 40-core Intel Xeon E54650 CPU and 504 GB RAM.
Software Dependencies No The paper mentions "Causal forests were fit according to default parameters specified in the Econ ML package (Battocchi et al., 2019)" and "Logistic Regression CV scikit-learn models" but does not specify version numbers for these software components.
Experiment Setup Yes Causal forests were fit according to default parameters specified in the Econ ML package (Battocchi et al., 2019) (with honesty enabled for valid and unbiased inference), and a fixed depth of 3 and minimum leaf size of 100 were used for subsequent CATE causal trees distilled from the forests to ensure subgroups remained interpretable. The causal forest implementation in Econ ML by default runs a two-fold cross validation internally when selecting hyperparameters for the Logistic Regression CV scikit-learn models for treatment, which searches over L2 regularization parameters in a grid of 10 values between 1e 4 and 1e4 using the default accuracy criterion.