X-Hacking: The Threat of Misguided AutoML
Authors: Rahul Sharma, Sumantrak Mukherjee, Andrea Sipka, Eyke Hüllermeier, Sebastian Josef Vollmer, Sergey Redyuk, David Antony Selby
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We illustrate empirically on familiar real-world datasets that, on average, Bayesian optimisation accelerates X-hacking 3-fold for features susceptible to it, versus random sampling. We demonstrate empirically in a post-hoc manner how offthe-shelf Auto ML pipelines can be used, even with a limited computational budget, to perform X-hacking on SHAP values for familiar real-world datasets, by cherry picking those models that support a desired narrative. We selected 23 datasets from the Open ML-CC18 classification benchmark (Bischl et al., 2021). |
| Researcher Affiliation | Collaboration | 1Deutsches Forschungszentrum f ur K unstliche Intelligenz Gmb H 2Institute of Informatics, Ludwig Maximilians-Universit at M unchen. Correspondence to: David Antony Selby <david.antony EMAIL>. |
| Pseudocode | No | The paper describes methodologies and experimental setups but does not include any explicitly labeled pseudocode or algorithm blocks with structured steps. |
| Open Source Code | Yes | Code to replicate our experiments is available on Git Hub. However, our code and experiments are open source (Git Hub: https://github. com/datasciapps/x-hacking) and we encourage others to adapt the experimental setting to their own view on what plausible effort might be on the part of an adversary. |
| Open Datasets | Yes | We selected 23 datasets from the Open ML-CC18 classification benchmark (Bischl et al., 2021), where the task was binary classification from tabular features. |
| Dataset Splits | Yes | for training all the models, baseline and Auto ML, we used 20% of the samples as test dataset for all of the datasets mentioned in Table 5. |
| Hardware Specification | Yes | Models were evaluated on an internal cluster using 192 CPUs with 300GB RAM. For ad-hoc X-hacking we have used at most 292 CPUs in parallel on an institutional computing cluster. To run many candidate models for each dataset, a maximum of 1 TB of RAM was used for the experiments. |
| Software Dependencies | No | Python packages scikit-learn (Pedregosa et al., 2011) and auto-sklearn (Feurer et al., 2019) built and trained the ML models, and package shap (Lundberg et al., 2018) estimated SHAP values from the successfully evaluated models. We used pandas and numpy libraries for data wrangling, scikit-learn as our base ML library, auto-sklearn for automated finding of models, optuna for multi-objective optimisation, and shap for calculating shap values. |
| Experiment Setup | Yes | The baseline model was a random forest classifier trained with scikit-learn with all parameters set to default. For the baseline and all models evaluated in Auto ML, the shap s model-agnostic Kernel Explainer routine was employed to compute SHAP values, using a background sample size 50 and test sample size 100. For each dataset, we run auto-sklearn for 3600 seconds in total and a run time limit of 100 seconds for each candidate model with the mentioned random seed. An ensemble size of 1 is used. MOTPESampler (Ozaki et al., 2022) was used as the Bayesian optimiser. Predictive performance was measured by accuracy score. The Auto ML pipeline was optimised for these two objectives simultaneously as a multi-objective optimisation where accuracy score was optimised to be maximum and mean absolute SHAP was optimised to be minimum. |