reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

On Averaging ROC Curves

Authors: Jack Hogan, Niall M. Adams

TMLR 2023 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Theoretical	For a simple illustration of the proposed averaging methods, we simulate scores arising from a probabilistic classifier employed on two datasets: the first well separable , the second poorly separable . We assume equally balanced classes for both datasets but allow the total number of instances to differ between datasets. The number of scores simulated for the negative and positive classes of the well separable dataset was n N1 = n P1 = 150; for the poorly separable dataset n N2 = n P2 = 75. Density estimates of the simulated positive and negative scores for the two datasets are shown in the top two panels of Figure 1a.
Researcher Affiliation	Academia	Jack Hogan EMAIL Department of Mathematics Imperial College London, London, UK. Niall M. Adams EMAIL Department of Mathematics Imperial College London, London, UK.
Pseudocode	No	The paper describes methods and concepts in prose and mathematical notation without explicit pseudocode or algorithm blocks.
Open Source Code	No	The paper discusses third-party software packages (ROCR, scikit-learn) and refers to their online tutorials, but it does not state that the authors are releasing their own code for the methodology described in this paper.
Open Datasets	No	The paper uses simulated scores for illustration, stating: 'For a simple illustration of the proposed averaging methods, we simulate scores arising from a probabilistic classifier employed on two datasets.' There is no mention of using or providing access to any publicly available or open datasets.
Dataset Splits	No	The paper uses simulated scores to illustrate concepts and mentions 'M = 100 datasets simulated in this way' but does not specify training, validation, or test splits for these datasets, as they are generated for illustrative purposes rather than used for model training and evaluation within the paper's own experimental setup.
Hardware Specification	No	The paper does not provide any specific hardware details such as GPU/CPU models, processor types, or memory used for running simulations or computations.
Software Dependencies	No	The paper mentions 'the R package ROCR' and 'the Python package scikit-learn' in a general discussion of existing tools, but it does not specify version numbers for any software dependencies used by the authors to conduct their simulations or analysis.
Experiment Setup	Yes	For a simple illustration of the proposed averaging methods, we simulate scores arising from a probabilistic classifier employed on two datasets: the first well separable , the second poorly separable . We assume equally balanced classes for both datasets but allow the total number of instances to differ between datasets. The number of scores simulated for the negative and positive classes of the well separable dataset was n N1 = n P1 = 150; for the poorly separable dataset n N2 = n P2 = 75. Density estimates of the simulated positive and negative scores for the two datasets are shown in the top two panels of Figure 1a. The output classification scores for two arbitrary classifiers were simulated as follows: For classifier 1 (C-1), negative and positive class scores for dataset i are simulated from Gaussian distributions N(µi0, σi0) and N(µi1, σi1) respectively.