reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

AutoCATE: End-to-End, Automated Treatment Effect Estimation

Authors: Toon Vanderschueren, Tim Verdonck, Mihaela Van Der Schaar, Wouter Verbeke

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	This section analyzes Auto CATE s design choices per stage: evaluation (5.2), estimation (5.3), and ensembling (5.4). We identify best practices and benchmark the resulting configuration against common alternatives (5.5). Our experiments compare various automated, end-to-end strategies for learning a CATE estimation pipeline. Using Auto CATE, we can evaluate a range of design choices. To obtain general insights, we leverage a collection of standard benchmarks for CATE estimation: IHDP (Hill, 2011), ACIC (Dorie et al., 2019), News (Johansson et al., 2016), and Twins (Louizos et al., 2017); see Appendix C for details. These semi-synthetic benchmarks include 247 distinct data sets that vary in outcome (regression and classification), dimensionality, size, and application area, allowing for a comprehensive analysis Auto CATE.
Researcher Affiliation	Academia	1KU Leuven 2University of Antwerp 3University of Cambridge. Correspondence to: Toon Vanderschueren <EMAIL>.
Pseudocode	No	The paper does not contain any explicitly labeled pseudocode or algorithm blocks. It describes the methods in narrative text and using figures like Figure 1 to illustrate the stages of Auto CATE.
Open Source Code	Yes	To facilitate broad adoption and further research, we release Auto CATE as an open-source software package. The software package and accompanying experimental code are publicly online at https://github.com/toonvds/Auto CATE.
Open Datasets	Yes	Our experiments compare various automated, end-to-end strategies for learning a CATE estimation pipeline. Using Auto CATE, we can evaluate a range of design choices. To obtain general insights, we leverage a collection of standard benchmarks for CATE estimation: IHDP (Hill, 2011), ACIC (Dorie et al., 2019), News (Johansson et al., 2016), and Twins (Louizos et al., 2017); see Appendix C for details. These semi-synthetic benchmarks include 247 distinct data sets that vary in outcome (regression and classification), dimensionality, size, and application area, allowing for a comprehensive analysis Auto CATE.
Dataset Splits	Yes	Figure 3 presents results for different holdout ratios, illustrating this trade-off and showing that a holdout ratio of 30-50% generally works well. We use 30% in the rest of this work. Although more folds in cross-validation often improve model performance in supervised settings, we do not observe this effect for Auto CATE (see Table 5), likely due to the interaction between the number of folds and the holdout ratio. Finally, we include a stratified training-validation split and a stratified k-fold cross-validation procedure. Following the experiments in the main body, we use a 70 30% train-test split.
Hardware Specification	Yes	These experiments were conducted locally, on a machine with an AMD Ryzen 7 PRO 4750U processor (1.70 GHz), 32 GB of RAM, and a 64-bit operating system.
Software Dependencies	No	Auto CATE is implemented in Python2, following scikit-learn s design principles (Pedregosa et al., 2011). Nevertheless, as the search is implemented with optuna (Akiba et al., 2019), we could use a range of optimizers. Where available, we use the Causal ML implementations (Chen et al., 2020).
Experiment Setup	Yes	Table 3: Preprocessor search spaces. We describe the search spaces for the different preprocessors. If a hyperparameter is not mentioned, we use its default. All preprocessors are implemented with scikit-learn (Pedregosa et al., 2011); we refer to their documentation for more information. Table 4: Baselearner search spaces. We describe the search spaces for each baselearner. If a hyperparameter is not mentioned, we use its default. All baselearners are implemented with scikit-learn (Pedregosa et al., 2011); we refer to their documentation for more information. While efficient optimization strategies such as Bayesian approaches could be used, we use random search throughout this work to focus on other design choices in Auto CATE.