reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

X-Hacking: The Threat of Misguided AutoML

Authors: Rahul Sharma, Sumantrak Mukherjee, Andrea Sipka, Eyke Hüllermeier, Sebastian Josef Vollmer, Sergey Redyuk, David Antony Selby

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We illustrate empirically on familiar real-world datasets that, on average, Bayesian optimisation accelerates X-hacking 3-fold for features susceptible to it, versus random sampling. We demonstrate empirically in a post-hoc manner how offthe-shelf Auto ML pipelines can be used, even with a limited computational budget, to perform X-hacking on SHAP values for familiar real-world datasets, by cherry picking those models that support a desired narrative. We selected 23 datasets from the Open ML-CC18 classification benchmark (Bischl et al., 2021).
Researcher Affiliation	Collaboration	1Deutsches Forschungszentrum f ur K unstliche Intelligenz Gmb H 2Institute of Informatics, Ludwig Maximilians-Universit at M unchen. Correspondence to: David Antony Selby <david.antony EMAIL>.
Pseudocode	No	The paper describes methodologies and experimental setups but does not include any explicitly labeled pseudocode or algorithm blocks with structured steps.
Open Source Code	Yes	Code to replicate our experiments is available on Git Hub. However, our code and experiments are open source (Git Hub: https://github. com/datasciapps/x-hacking) and we encourage others to adapt the experimental setting to their own view on what plausible effort might be on the part of an adversary.
Open Datasets	Yes	We selected 23 datasets from the Open ML-CC18 classification benchmark (Bischl et al., 2021), where the task was binary classification from tabular features.
Dataset Splits	Yes	for training all the models, baseline and Auto ML, we used 20% of the samples as test dataset for all of the datasets mentioned in Table 5.
Hardware Specification	Yes	Models were evaluated on an internal cluster using 192 CPUs with 300GB RAM. For ad-hoc X-hacking we have used at most 292 CPUs in parallel on an institutional computing cluster. To run many candidate models for each dataset, a maximum of 1 TB of RAM was used for the experiments.
Software Dependencies	No	Python packages scikit-learn (Pedregosa et al., 2011) and auto-sklearn (Feurer et al., 2019) built and trained the ML models, and package shap (Lundberg et al., 2018) estimated SHAP values from the successfully evaluated models. We used pandas and numpy libraries for data wrangling, scikit-learn as our base ML library, auto-sklearn for automated finding of models, optuna for multi-objective optimisation, and shap for calculating shap values.
Experiment Setup	Yes	The baseline model was a random forest classifier trained with scikit-learn with all parameters set to default. For the baseline and all models evaluated in Auto ML, the shap s model-agnostic Kernel Explainer routine was employed to compute SHAP values, using a background sample size 50 and test sample size 100. For each dataset, we run auto-sklearn for 3600 seconds in total and a run time limit of 100 seconds for each candidate model with the mentioned random seed. An ensemble size of 1 is used. MOTPESampler (Ozaki et al., 2022) was used as the Bayesian optimiser. Predictive performance was measured by accuracy score. The Auto ML pipeline was optimised for these two objectives simultaneously as a multi-objective optimisation where accuracy score was optimised to be maximum and mean absolute SHAP was optimised to be minimum.