reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Enhancing Treatment Effect Estimation via Active Learning: A Counterfactual Covering Perspective

Authors: Hechuan Wen, Tong Chen, Mingming Gong, Li Kheng Chai, Shazia Sadiq, Hongzhi Yin

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Furthermore, benchmarking FCCM against other baselines demonstrates its superiority across both fully synthetic and semi-synthetic datasets. Code: https://github.com/uqhwen2/FCCM. ... In Figure 5, it is observed that our proposed method is generally served as the risk lower bound in all three datasets. Its outstanding performance empirically proves the superiority of our method...
Researcher Affiliation	Academia	1School of EECS, The University of Queensland, Australia 2School of Mathematics and Statistics, The University of Melbourne, Australia 3Department of Machine Learning, Mohamed bin Zayed University of Artificial Intelligence, United Arab Emirates 4Health and Wellbeing Queensland, Australia. Correspondence to: Hongzhi Yin <EMAIL>.
Pseudocode	Yes	Algorithm 1 Greedy Radius Reduction (Sketch) ... Algorithm 2 FCCM
Open Source Code	Yes	Code: https://github.com/uqhwen2/FCCM.
Open Datasets	Yes	IBM (Shimoni et al., 2018) a high-dimensional tabular dataset based on the publicly available Linked Births and Infant Deaths Database. ... CMNIST (Jesson et al., 2021a) This dataset contains 60,000 image samples (10 classes) of size 28 28, which are adapted from MINIST (Le Cun, 1998) benchmark.
Dataset Splits	Yes	The details of the data acquisition setup is summarized in Table 1, where we initialize the training set S with the entire labeled samples (denoted as ALL) from group t = 0 and start acquisition only on the sample from t = 1, which simulates scenarios with a significant number of missing counterfactual samples. Then, a fixed step length is enforced at each acquisition step with fifty data acquisition steps.Table 1. Summary of the Acquisition Setup and Testing Dataset Start Length Steps Pool Val Test TOY ALL 1 50 7200 2880 1600 IBM ALL* 50 50 2891 3180 6250 CMNIST ALL* 50 50 16706 10500 18000
Hardware Specification	Yes	We conduct all the experiments with 24GB NVIDIA RTX-3090 GPU on Ubuntu 22.04 LTS platform with the 12th Gen Intel i7-12700K 12-Core 20-Thread CPU.
Software Dependencies	No	We conduct all the experiments with 24GB NVIDIA RTX-3090 GPU on Ubuntu 22.04 LTS platform with the 12th Gen Intel i7-12700K 12-Core 20-Thread CPU. As stated in the main text, for fair comparison, we take the consistent hyperparameters tuned in (Jesson et al., 2021b; Wen et al., 2025) for the estimators: DUE-DNN (Van Amersfoort et al., 2021) and DUE-CNN (Van Amersfoort et al., 2021) shown in Table 3.
Experiment Setup	Yes	As stated in the main text, for fair comparison, we take the consistent hyperparameters tuned in (Jesson et al., 2021b; Wen et al., 2025) for the estimators: DUE-DNN (Van Amersfoort et al., 2021) and DUE-CNN (Van Amersfoort et al., 2021) shown in Table 3. Additionally, we search the best hyperparameters, i.e., covering radius δ and edge weight α for counterfactual linkage, for Algorithm 2 with the validation set shown in Table 4.Table 3. Hyperparameters for Estimators Hyperparameters DNN CNN Kernel RBF Matern Inducing Points 100 100 Hidden Neurons 200 200 Depth 3 2 Dropout Rate 0.1 0.05 Spectral Norm 0.95 3.0 Learning Rate 1e-3 1e-3Table 4. Hyperparameters for Algorithm 2 Hyperparameters Search Space Tuned δ(1,1) for TOY [0.11, 0.12, 0.13] 0.11 δ(1,0) for TOY [0.11, 0.12, 0.13] 0.11 δ(1,1) for IBM [0.11, 0.13, 0.15] 0.11 δ(1,0) for IBM [0.11, 0.13, 0.15] 0.11 δ(1,1) for CMNIST [0.40, 0.45, 0.50] 0.50 δ(1,0) for CMNIST [0.40, 0.45, 0.50] 0.40 Edge weight α [1.0, 2.5, 5.0] 2.5