reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Doubly robust identification of treatment effects from multiple environments

Authors: Piersilvio De Bartolomeis, Julia Kostin, Javier Abad, Yixin Wang, Fanny Yang

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Empirical evaluations across synthetic, semi-synthetic, and real-world datasets show that our approach significantly outperforms existing methods.
Researcher Affiliation	Academia	1ETH Zurich 2University of Michigan
Pseudocode	Yes	A.3.1 Algorithm 1: Combinatorial search over subsets A.3.2 Algorithm 2: Gumbel trick
Open Source Code	Yes	1See our GitHub repository: https://github.com/jaabmar/RAMEN/
Open Datasets	Yes	The IHDP dataset contains covariates from n = 748 lowbirth-weight, premature infants enrolled in a home visitation program designed to improve their cognitive scores (Hill, 2011). In our real-world experiment, we evaluate our method on the observational dataset from Cattaneo (2010) that studies the effect of maternal smoking (treatment T) during pregnancy on birth weight (outcome Y ) using the data from n = 4642 patients.
Dataset Splits	Yes	In Figure 2 (Row 1), we present the empirical MAE for all methods on finite-sample experiments that confirm the predictions from theory. First of all, both of our methods, θˆ and θˆinsta , consistently achieve lower MAE compared to the baselines in all scenarios. In particular, we observe that the differentiable relaxation of our method does not significantly compromise statistical performance. Further, for T-invariance, the performance of θˆirm deteriorates markedly as expected e.g. in scenarios where the post-treatment variable is a descendant of Y , it performs worse than simply adjusting for all available covariates. In contrast, our approach remains robust even when one of the invariances is compromised. Finally, we observe that relying on T-invariance increases the error across methods, possibly because the adjustment set we recover, the parents of the treatment, leads to a statistically less efficient estimator, see e.g. Henckel et al. (2022, Corollary 3.4). In our experiments we sample 100 DAGs and for each DAG vary the number of environments while keeping the sample size fixed. We set the number of environments to \|E\| = 5. We split the original dataset into \|E\| = 4 environments defined by the trimester of birth.
Hardware Specification	No	No specific hardware details (like GPU/CPU models, processor types, or memory amounts) are mentioned in the paper for running the experiments. The text focuses on the methodology, datasets, and software configurations without specifying the underlying hardware infrastructure.
Software Dependencies	No	Implementation details We implement our method, θˆinsta , by performing a hyperparameter search over the following parameters at each iteration: learning rate in the range [0.001, 0.01, 0.1], initial temperature values of [0.5, 0.8, 1.0], and annealing rates of [0.9, 0.95, 0.99]. The optimal combination of these hyperparameters is selected based on minimizing both T-invariance and Yinvariance loss. The outcome functions for θˆall, θˆ and θˆirm are estimated using a linear regression model. Logistic regression is used for propensity score estimation. Implementation details We implement our method, θˆinsta , by performing a hyperparameter search over the following parameters at each iteration: learning rate in the range [0.001, 0.01, 0.1], initial temperature values of [0.5, 0.8, 1.0], and annealing rates of [0.9, 0.95, 0.99]. The optimal combination of these hyperparameters is selected based on the minimization of both T-invariance and Y-invariance loss. The outcome and treatment assignment functions for both θˆall and θˆinsta are estimated using XGBoost. For these models, we set the number of estimators to 1,000, the learning rate to 0.01, and the maximum tree depth to 6. For the non-linear IRM baseline, we employ the TARNet architecture (Shalit et al., 2017), which consists of a shared representation with a single hidden layer of 200 neurons, followed by two hypothesis-specific hidden layers, each with 100 neurons. Logistic regression is used for propensity score estimation.
Experiment Setup	Yes	Implementation details We implement our method, θˆinsta , by performing a hyperparameter search over the following parameters at each iteration: learning rate in the range [0.001, 0.01, 0.1], initial temperature values of [0.5, 0.8, 1.0], and annealing rates of [0.9, 0.95, 0.99]. The optimal combination of these hyperparameters is selected based on minimizing both T-invariance and Yinvariance loss. The outcome functions for θˆall, θˆ and θˆirm are estimated using a linear regression model. Logistic regression is used for propensity score estimation. Implementation details We implement our method, θˆinsta , by performing a hyperparameter search over the following parameters at each iteration: learning rate in the range [0.001, 0.01, 0.1], initial temperature values of [0.5, 0.8, 1.0], and annealing rates of [0.9, 0.95, 0.99]. The optimal combination of these hyperparameters is selected based on the minimization of both T-invariance and Y-invariance loss. The outcome and treatment assignment functions for both θˆall and θˆinsta are estimated using XGBoost. For these models, we set the number of estimators to 1,000, the learning rate to 0.01, and the maximum tree depth to 6. For the non-linear IRM baseline, we employ the TARNet architecture (Shalit et al., 2017), which consists of a shared representation with a single hidden layer of 200 neurons, followed by two hypothesis-specific hidden layers, each with 100 neurons. Logistic regression is used for propensity score estimation. Implementation details We implement our method, θˆinsta , using the following hyperparameters: the number of epochs is set to 700, patience to 100, learning rate to 0.1, initial temperature to 1.0, and annealing rate to 0.9. This configuration was chosen because it provided robust and favorable results across experiments, specifically in minimizing Tand Y-invariance losses. All other hyperparameters are kept from previous experiments. The outcome and treatment assignment functions for both θˆall and θˆinsta are estimated using XGBoost, with the number of estimators set to 1,000, learning rate to 0.01, and maximum depth to 6. For the non-linear IRM implementation, we use the TARNet architecture, as in the IHDP experiments.