reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Estimating the Replication Probability of Significant Classification Benchmark Experiments

Authors: Daniel Berrar

JMLR 2024 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Using simulation studies, we show that p-values just below the common signiﬁcance threshold of 0.05 are insuﬃcient to warrant a high conﬁdence in the replicability of signiﬁcant results, as such p-values are barely more informative than the ﬂip of a coin. If a replication probability of around 0.95 is desired, then the signiﬁcance threshold should be lowered to at least 0.003. This observation might explain, at least in part, why many published research ﬁndings fail to replicate.
Researcher Affiliation	Academia	Daniel Berrar EMAIL Machine Learning Research Group School of Mathematics and Statistics The Open University Milton Keynes MK7 6AA, United Kingdom and Department of Information and Communications Engineering School of Engineering Tokyo Institute of Technology 2-12-1-S3-70 Ookayama, Meguro-ku, Tokyo 152-8550, Japan
Pseudocode	Yes	Algorithm 1: Bootstrapped estimate of the standard deviation of Zw. Algorithm 2: Comparison of NB and SVM over multiple datasets Algorithm 3: Comparison of SVM and SVMo in cross-validation. Algorithm 4: Comparison of SVM and RF in cross-validation.
Open Source Code	Yes	Code and Data Availability The R code is available at the project website at https://osf.io/7vqfn/.
Open Datasets	Yes	Two classiﬁers, A and B, are compared on 44 diﬀerent benchmark datasets from the UCI repository. The diﬀerence in performance (for example, accuracy) is assessed based on a suitable signiﬁcance test.
Dataset Splits	Yes	Two classiﬁers, A and B, are compared in k-fold cross-validation, and A signiﬁcantly outperforms B. The replication probability can then be estimated as follows. Example 4. Let us assume that in 10-fold cross-validation, we observe a variance-corrected t-statistic of 2.262, which corresponds to a two-sided p-value of 0.05.
Hardware Specification	Yes	All models and experiments were implemented on a standard PC (Intel Core i7-7700T CPU, 2.90GHz 8, 32GB RAM).
Software Dependencies	Yes	As learning algorithms, we used the naive Bayes (NB) algorithm and the support vector machine (SVM), both implemented with the R library e1071 (Meyer et al., 2022), and random forests (RF), implemented with the R library randomForest (Liaw and Wiener, 2002). ... Meyer, E. Dimitriadou, K. Hornik, A. Weingessel, and F Leisch. e1071: Misc Functions of the Department of Statistics, Probability Theory Group (Formerly: E1071), TU Wien, 2022. URL https://CRAN.R-project.org/package=e1071. R package version 1.7-11.
Experiment Setup	Yes	The default hyperparameters were used, and no further optimization was performed.