reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

A Comparative Evaluation of Quantification Methods

Authors: Tobias Schumacher, Markus Strohmaier, Florian Lemmerich

JMLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this work, we close this research gap by conducting a thorough empirical performance comparison of 24 diﬀerent quantiﬁcation methods on in total more than 40 datasets, considering binary as well as multiclass quantiﬁcation settings. We observe that no single algorithm generally outperforms all competitors, but identify a group of methods that perform best in the binary setting...
Researcher Affiliation	Academia	Tobias Schumacher EMAIL University of Mannheim, Germany RWTH Aachen University, Germany Markus Strohmaier EMAIL University of Mannheim, Germany GESIS Leibniz Institute for the Social Sciences, Germany Complexity Science Hub, Austria Florian Lemmerich EMAIL University of Passau, Germany
Pseudocode	No	The paper describes algorithms in prose within section "3. Algorithms for Quantiﬁcation" but does not contain explicitly labeled pseudocode or algorithm blocks.
Open Source Code	Yes	The implementation of the algorithms and experiments can be found on Git Hub1. 1. https://github.com/tobiasschumacher/quantification_paper
Open Datasets	Yes	We applied all algorithms on a broad range of 40 datasets collected from the UCI machine learning repository2 and from Kaggle3. An overview of these datasets, along with their characteristics and abbreviations that we use when describing our results, is given in Table 2. 2. https://archive.ics.uci.edu/ml/index.php 3. https://www.kaggle.com/datasets
Dataset Splits	Yes	Regarding training and test distributions, in the binary case, we considered diﬀerent prevalences of training positives postrain and test positives postest in the respective sets postrain {0.05, 0.1, 0.3, 0.5, 0.7, 0.9} and postest {0, 0.01, 0.05, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9}, following the protocol introduced by Forman (2008). ... In both binary and multiclass settings, we considered splits with relative amounts of training versus test data samples in {(0.1, 0.9), (0.3, 0.7), (0.5, 0.5), (0.7, 0.3)}, thereby simulating scenarios in which we have little as well as relatively much data at hand to train our models.
Hardware Specification	No	The authors acknowledge support by the state of Baden-Württemberg through the bw HPC and the German Research Foundation (DFG) through grant INST 35/1597-1 FUGG. This mentions an HPC resource but does not provide specific hardware details such as CPU or GPU models.
Software Dependencies	No	Except for the SVMperf-based quantiﬁers and quantiﬁcation forests, all algorithms were implemented from scratch in Python 3, using scikit-learn as base implementation for the underlying classiﬁers and the package cvxpy (Diamond and Boyd, 2016) to solve constrained optimization problems. The versions for `scikit-learn` and `cvxpy` are not specified.
Experiment Setup	Yes	In our main experiments, we chose the following hyperparameters for the quantiﬁers: As mentioned above, for all methods that use a classiﬁer to perform quantiﬁcation, we used the logistic regression classiﬁer with the default L-BFGS solver along with its built-in probability estimator provided by scikit-learn and set the number of maximum iterations at 1000. We always used stratiﬁed 10-fold cross-validation on the training set when estimating the misclassiﬁcation rates or computing the set of scores and thresholds that the quantiﬁers needed. ... For the Dy S framework, including the HDy method, we chose to divide its conﬁdence scores into 10 bins...