Metrics of Calibration for Probabilistic Predictions

Authors: Imanol Arrieta-Ibarra, Paman Gujral, Jonathan Tannen, Mark Tygert, Cherie Xu

JMLR 2022 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental This section illustrates the methods of the previous section via analysis of both synthetic and measured data sets. The synthetic examples include the ground-truth known by construction. They first highlight practical problems with the ECEs, then validate the theory of the previous section directly and explicitly. The examples on measured data display even more extreme practical problems with the ECEs, especially in comparison with the ECCEs. Section 3.1 presents the synthetic examples, while Section 3.2 analyzes in detail one of the most popular data sets from computer vision, Image Net of Russakovsky et al. (2015).
Researcher Affiliation Industry Imanol Arrieta-Ibarra EMAIL Paman Gujral EMAIL Jonathan Tannen EMAIL Mark Tygert EMAIL Cherie Xu EMAIL Meta, 1 Facebook Way, Menlo Park, CA 94025, USA
Pseudocode No The paper describes mathematical definitions and theoretical proofs, as well as illustrative examples and analyses of synthetic and measured data sets. However, it does not contain any clearly labeled pseudocode blocks or algorithms in a structured, code-like format.
Open Source Code Yes Permissively licensed open-source software that automatically reproduces all figures and statistics reported below is available at https://github.com/facebookresearch/ecevecce
Open Datasets Yes Section 3.2 analyzes in detail one of the most popular data sets from computer vision, Image Net of Russakovsky et al. (2015).
Dataset Splits No The paper uses the "standard training data set Image Net-1000" with a total of n = 1,281,167 images. While it mentions the use of a training dataset, it does not provide explicit details on how this dataset was further split into training, validation, or test sets for the purpose of their calibration experiments, or refer to predefined splits with specific percentages or counts.
Hardware Specification No The paper states that scores were calculated using a "pretrained Res Net18 classifier... from the computer-vision module, torchvision, in the Py Torch software library." However, it does not provide any specific details about the hardware (e.g., GPU models, CPU types, memory) used to run the experiments or perform the analysis.
Software Dependencies No The paper mentions using "the Py Torch software library of Paszke et al. (2019)". While PyTorch is named, a specific version number for PyTorch or any other software dependency is not provided.
Experiment Setup No The paper describes how scores were generated using a "pretrained Res Net18 classifier" and how responses were obtained (Rk = 1 for correct classification, Rk = 0 otherwise). However, it does not provide specific experimental setup details such as hyperparameters (e.g., learning rate, batch size, number of epochs) or system-level training settings for either the ResNet18 classifier itself or for any other component of their analysis.