reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Causal Classification: Treatment Effect Estimation vs. Outcome Prediction

Authors: Carlos Fernández-Loría, Foster Provost

JMLR 2022 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	The theoretical results, as well as simulations, illustrate settings where outcome prediction should actually be better, including cases where (1) the bias may be partially corrected by choosing a diﬀerent threshold, (2) outcomes and treatment eﬀects are correlated, and (3) data to estimate counterfactuals are limited. A major practical implication is that, for some applications, it might be feasible to make good intervention decisions without any data on how individuals actually behave when intervened. Finally, we show that for a real online advertising application, outcome prediction models indeed excel at causal classiﬁcation.
Researcher Affiliation	Academia	Carlos Fern andez-Lor ıa EMAIL HKUST Business School Hong Kong University of Science and Technology Hong Kong Foster Provost EMAIL Stern School of Business New York University New York, NY, USA
Pseudocode	No	The paper includes Python code in Appendix C but it is actual code, not pseudocode or a clearly labeled algorithm block.
Open Source Code	Yes	Appendix C. Simulator Code We present here the Python code we used to generate the data in Section 6.
Open Datasets	Yes	We use data made available by Criteo (an advertising platform) based on randomly targeting advertising to a large sample of users (Diemert Eustache, Betlei Artem et al., 2018). ... See https://ailab.criteo.com/criteo-uplift-prediction-dataset/ for details and access to the data. We use the version of the data set without leakage.
Dataset Splits	Yes	The models were trained and tuned with cross-validation using 80% of the sample (the training set). The targeting approaches were evaluated using the remaining 20% of the sample (the test set).
Hardware Specification	No	The paper does not specify any particular hardware used for running the simulations or the real-world example in Appendix D.
Software Dependencies	No	The Python code in Appendix C imports the 'numpy' library ('import numpy as np'), but no specific version numbers for Python or NumPy are provided.
Experiment Setup	No	For the simulations, 'Table 4 shows the default values used for the simulation parameters.' These are parameters for the simulator, not explicit hyperparameters for a machine learning model. For the practical example, 'All the approaches were implemented using decision tree models.' but no specific hyperparameters for these decision tree models are provided.