Uplift Model Evaluation with Ordinal Dominance Graphs

Authors: Brecht Verbeken, Marie-Anne Guerry, Wouter Verbeke, Sam Verboven

JMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Finally, we empirically validate the improved discriminative power of ROCini and p ROCini in a simulation study as well as via experiments on real data.
Researcher Affiliation Academia Brecht Verbeken EMAIL Department of Business Technology and Operations, Data Analytics Laboratory Vrije Universiteit Brussel (VUB) Pleinlaan 2, 1050 Brussels, Belgium Marie-Anne Guerry EMAIL Department of Business Technology and Operations, Data Analytics Laboratory Vrije Universiteit Brussel (VUB) Pleinlaan 2, 1050 Brussels, Belgium Wouter Verbeke EMAIL Faculty of Economics and Business, KU Leuven Naamsestraat 69, Leuven 3000, Belgium Sam Verboven EMAIL Department of Business Technology and Operations, Vrije Universiteit Brussel (VUB) Pleinlaan 2, 1050 Brussels, Belgium
Pseudocode Yes Algorithm 1 Simulation of uplift model scores
Open Source Code No No explicit statement about the authors' own code for the methodology being open-sourced or a repository link is provided. The paper mentions using the 'sklift package' but this refers to a third-party tool.
Open Datasets Yes In this subsection, we present the results on three commonly used uplift modelling benchmark data sets: the Hillstrom (Hillstrom, 2008), Criteo (Diemert, Eustache et al., 2018), and Information (Writer and Others, 2021) data sets.
Dataset Splits No The paper mentions that for the semi-synthetic evaluation, 'We applied these models to a population of 1,000 observations drawn from the Hillstrom data set'. While it discusses the simulation protocol (Algorithm 1) and the setup of treatment and control groups, it does not provide explicit train/test/validation splits (e.g., percentages or counts) for the empirical models trained on the real datasets (Hillstrom, Criteo, Information) or for the semi-synthetic experiment.
Hardware Specification No The paper does not mention any specific hardware (e.g., CPU, GPU models, memory, or cloud instance types) used to run the experiments.
Software Dependencies No The paper mentions using the 'sklift package' but does not specify a version number. No other software dependencies are mentioned with version numbers.
Experiment Setup Yes Specifically, we augment the original Hillstrom data set by generating synthetic outcomes via a logistic function, given by pi = 1 1 + exp β0 + X i β + βt Ti + ϵi , where Xi represents the original (standardized) features for observation i, Ti represents the treatment indicator, βt represents the average treatment effect parameter, and ϵi N(0, σ2). ... with parameters β0 = 0.0, β = 1, βt = 0.5, and σ = 0.1 (Marchese et al., 2025; Hill, 2011; Alaa and Van Der Schaar, 2017). This setup yields nonlinear treatment response behaviour and allows full control over the treatment effect strength and noise. We trained four uplift models based on standard S-learner and T-learner strategies Künzel et al. (2019); Curth and Van der Schaar (2021), each implemented with two widely used base learners: Logistic Regression and XGBoost.