Estimating the Replication Probability of Significant Classification Benchmark Experiments

Authors: Daniel Berrar

JMLR 2024 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Using simulation studies, we show that p-values just below the common significance threshold of 0.05 are insufficient to warrant a high confidence in the replicability of significant results, as such p-values are barely more informative than the flip of a coin. If a replication probability of around 0.95 is desired, then the significance threshold should be lowered to at least 0.003. This observation might explain, at least in part, why many published research findings fail to replicate.
Researcher Affiliation Academia Daniel Berrar EMAIL Machine Learning Research Group School of Mathematics and Statistics The Open University Milton Keynes MK7 6AA, United Kingdom and Department of Information and Communications Engineering School of Engineering Tokyo Institute of Technology 2-12-1-S3-70 Ookayama, Meguro-ku, Tokyo 152-8550, Japan
Pseudocode Yes Algorithm 1: Bootstrapped estimate of the standard deviation of Zw. Algorithm 2: Comparison of NB and SVM over multiple datasets Algorithm 3: Comparison of SVM and SVMo in cross-validation. Algorithm 4: Comparison of SVM and RF in cross-validation.
Open Source Code Yes Code and Data Availability The R code is available at the project website at https://osf.io/7vqfn/.
Open Datasets Yes Two classifiers, A and B, are compared on 44 different benchmark datasets from the UCI repository. The difference in performance (for example, accuracy) is assessed based on a suitable significance test.
Dataset Splits Yes Two classifiers, A and B, are compared in k-fold cross-validation, and A significantly outperforms B. The replication probability can then be estimated as follows. Example 4. Let us assume that in 10-fold cross-validation, we observe a variance-corrected t-statistic of 2.262, which corresponds to a two-sided p-value of 0.05.
Hardware Specification Yes All models and experiments were implemented on a standard PC (Intel Core i7-7700T CPU, 2.90GHz 8, 32GB RAM).
Software Dependencies Yes As learning algorithms, we used the naive Bayes (NB) algorithm and the support vector machine (SVM), both implemented with the R library e1071 (Meyer et al., 2022), and random forests (RF), implemented with the R library randomForest (Liaw and Wiener, 2002). ... Meyer, E. Dimitriadou, K. Hornik, A. Weingessel, and F Leisch. e1071: Misc Functions of the Department of Statistics, Probability Theory Group (Formerly: E1071), TU Wien, 2022. URL https://CRAN.R-project.org/package=e1071. R package version 1.7-11.
Experiment Setup Yes The default hyperparameters were used, and no further optimization was performed.