reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Uplift Model Evaluation with Ordinal Dominance Graphs

Authors: Brecht Verbeken, Marie-Anne Guerry, Wouter Verbeke, Sam Verboven

JMLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Finally, we empirically validate the improved discriminative power of ROCini and p ROCini in a simulation study as well as via experiments on real data.
Researcher Affiliation	Academia	Brecht Verbeken EMAIL Department of Business Technology and Operations, Data Analytics Laboratory Vrije Universiteit Brussel (VUB) Pleinlaan 2, 1050 Brussels, Belgium Marie-Anne Guerry EMAIL Department of Business Technology and Operations, Data Analytics Laboratory Vrije Universiteit Brussel (VUB) Pleinlaan 2, 1050 Brussels, Belgium Wouter Verbeke EMAIL Faculty of Economics and Business, KU Leuven Naamsestraat 69, Leuven 3000, Belgium Sam Verboven EMAIL Department of Business Technology and Operations, Vrije Universiteit Brussel (VUB) Pleinlaan 2, 1050 Brussels, Belgium
Pseudocode	Yes	Algorithm 1 Simulation of uplift model scores
Open Source Code	No	No explicit statement about the authors' own code for the methodology being open-sourced or a repository link is provided. The paper mentions using the 'sklift package' but this refers to a third-party tool.
Open Datasets	Yes	In this subsection, we present the results on three commonly used uplift modelling benchmark data sets: the Hillstrom (Hillstrom, 2008), Criteo (Diemert, Eustache et al., 2018), and Information (Writer and Others, 2021) data sets.
Dataset Splits	No	The paper mentions that for the semi-synthetic evaluation, 'We applied these models to a population of 1,000 observations drawn from the Hillstrom data set'. While it discusses the simulation protocol (Algorithm 1) and the setup of treatment and control groups, it does not provide explicit train/test/validation splits (e.g., percentages or counts) for the empirical models trained on the real datasets (Hillstrom, Criteo, Information) or for the semi-synthetic experiment.
Hardware Specification	No	The paper does not mention any specific hardware (e.g., CPU, GPU models, memory, or cloud instance types) used to run the experiments.
Software Dependencies	No	The paper mentions using the 'sklift package' but does not specify a version number. No other software dependencies are mentioned with version numbers.
Experiment Setup	Yes	Speciﬁcally, we augment the original Hillstrom data set by generating synthetic outcomes via a logistic function, given by pi = 1 1 + exp β0 + X i β + βt Ti + ϵi , where Xi represents the original (standardized) features for observation i, Ti represents the treatment indicator, βt represents the average treatment eﬀect parameter, and ϵi N(0, σ2). ... with parameters β0 = 0.0, β = 1, βt = 0.5, and σ = 0.1 (Marchese et al., 2025; Hill, 2011; Alaa and Van Der Schaar, 2017). This setup yields nonlinear treatment response behaviour and allows full control over the treatment eﬀect strength and noise. We trained four uplift models based on standard S-learner and T-learner strategies Künzel et al. (2019); Curth and Van der Schaar (2021), each implemented with two widely used base learners: Logistic Regression and XGBoost.