Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1]

Doubly Robust Kernel Statistics for Testing Distributional Treatment Effects

Authors: Jake Fawkes, Robert Hu, Robin J. Evans, Dino Sejdinovic

TMLR 2024 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Finally, we experimentally and theoretically demonstrate the validity of these tests. ... 3. We experimentally validate the performance of our test on synthetic, semisynthetic and real data. ... 5 Experiments
Researcher Affiliation Collaboration Jake Fawkes EMAIL Department of Statistics University of Oxford Robert Hu EMAIL Amazon Robin J. Evans EMAIL Department of Statistics University of Oxford Dino Sejdinovic EMAIL School of Mathematical Sciences University of Adelaide
Pseudocode No The paper contains mathematical derivations and descriptions of algorithms, but no explicitly labeled 'Pseudocode' or 'Algorithm' block with structured code-like steps.
Open Source Code Yes An implementation of our approach can be found at: https://github.com/Jakefawkes/DR_distributional_test.
Open Datasets Yes We evaluate on two standard semi-synthetic tasks, the infant health and development program (IDHP) introduced in Hill (2011), the linked births and deaths data (LBIDD) (Shimoni et al., 2018).
Dataset Splits No The paper mentions 'randomly split into train/test sets, DTr, DTe' but does not specify exact percentages or sample counts for these splits. It also mentions 'We run these experiments with 2000 data points' for simulated data, but this is a total number, not a split.
Hardware Specification No The paper does not provide specific details about the hardware (e.g., GPU/CPU models, memory) used for running the experiments.
Software Dependencies No The paper does not list specific software dependencies with their version numbers (e.g., programming languages, libraries, or frameworks).
Experiment Setup Yes For both settings we fit a linear logistic regression for the propensity score so that the model is incorrectly specified. ... We run these experiments with 2000 data points, rejecting at the 0.05 significance level. ... The matching for all statistics is done via logistic regression and we apply the permutation from Section 4. ... We again use logistic regression matching and weights model.