reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Metrics of Calibration for Probabilistic Predictions

Authors: Imanol Arrieta-Ibarra, Paman Gujral, Jonathan Tannen, Mark Tygert, Cherie Xu

JMLR 2022 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	This section illustrates the methods of the previous section via analysis of both synthetic and measured data sets. The synthetic examples include the ground-truth known by construction. They ﬁrst highlight practical problems with the ECEs, then validate the theory of the previous section directly and explicitly. The examples on measured data display even more extreme practical problems with the ECEs, especially in comparison with the ECCEs. Section 3.1 presents the synthetic examples, while Section 3.2 analyzes in detail one of the most popular data sets from computer vision, Image Net of Russakovsky et al. (2015).
Researcher Affiliation	Industry	Imanol Arrieta-Ibarra EMAIL Paman Gujral EMAIL Jonathan Tannen EMAIL Mark Tygert EMAIL Cherie Xu EMAIL Meta, 1 Facebook Way, Menlo Park, CA 94025, USA
Pseudocode	No	The paper describes mathematical definitions and theoretical proofs, as well as illustrative examples and analyses of synthetic and measured data sets. However, it does not contain any clearly labeled pseudocode blocks or algorithms in a structured, code-like format.
Open Source Code	Yes	Permissively licensed open-source software that automatically reproduces all ﬁgures and statistics reported below is available at https://github.com/facebookresearch/ecevecce
Open Datasets	Yes	Section 3.2 analyzes in detail one of the most popular data sets from computer vision, Image Net of Russakovsky et al. (2015).
Dataset Splits	No	The paper uses the "standard training data set Image Net-1000" with a total of n = 1,281,167 images. While it mentions the use of a training dataset, it does not provide explicit details on how this dataset was further split into training, validation, or test sets for the purpose of their calibration experiments, or refer to predefined splits with specific percentages or counts.
Hardware Specification	No	The paper states that scores were calculated using a "pretrained Res Net18 classiﬁer... from the computer-vision module, torchvision, in the Py Torch software library." However, it does not provide any specific details about the hardware (e.g., GPU models, CPU types, memory) used to run the experiments or perform the analysis.
Software Dependencies	No	The paper mentions using "the Py Torch software library of Paszke et al. (2019)". While PyTorch is named, a specific version number for PyTorch or any other software dependency is not provided.
Experiment Setup	No	The paper describes how scores were generated using a "pretrained Res Net18 classiﬁer" and how responses were obtained (Rk = 1 for correct classification, Rk = 0 otherwise). However, it does not provide specific experimental setup details such as hyperparameters (e.g., learning rate, batch size, number of epochs) or system-level training settings for either the ResNet18 classifier itself or for any other component of their analysis.