reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Improving Predictor Reliability with Selective Recalibration

Authors: Thomas P Zollo, Zhun Deng, Jake Snell, Toniann Pitassi, Richard Zemel

TMLR 2024 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We provide theoretical analysis to motivate our algorithm, and test our method through comprehensive experiments on difficult medical imaging and zero-shot classification tasks. Our results show that selective recalibration consistently leads to significantly lower calibration error than a wide range of selection and recalibration baselines.
Researcher Affiliation	Academia	Thomas P. Zollo EMAIL Columbia University Zhun Deng EMAIL Columbia University Jake C. Snell EMAIL Princeton University Toniann Pitassi EMAIL Columbia University Richard Zemel EMAIL Columbia University
Pseudocode	No	The paper describes the proposed method, Selective Recalibration, in Section 4, detailing components like Selection Loss (4.1), Coverage Loss (4.2), and Recalibration Models (4.3). However, these descriptions are provided in prose and mathematical formulations within the main text, rather than in explicitly labeled, structured pseudocode or algorithm blocks.
Open Source Code	No	Recalibration model code is taken from the accompanying code releases from Guo et al. (2017)2 (Temperature Scaling) and Kumar et al. (2019)3 (Platt Scaling, Histogram Binning, Platt Binning). While these external codebases are referenced, the paper does not provide an explicit statement or link for the source code specifically implementing the authors' proposed 'selective recalibration' methodology.
Open Datasets	Yes	We test selective recalibration and S-TLBCE in real-world medical diagnosis and image classification experiments... Camelyon17 (Bandi et al., 2018) is a task where the input x is a 96x96 patch... Image Net is a well-known large scale image classification dataset... Rx Rx1 (Taylor et al., 2019) is a task where the input x is a 3-channel image... CIFAR-100 is a well-known image classification dataset, and we perform zero-shot image classification with CLIP.
Dataset Splits	Yes	The validation set has 34,904 examples and accuracy of 91%, while the test set has 84,054 examples, and accuracy of 83%. (Camelyon17, A.3.1) We extract the features, scores, and labels from the 50,000 Image Net validation samples... (Image Net, A.3.2) The validation set has 9,854 examples and accuracy of 18%, while the test set has 34,432 examples, and accuracy of 27%. (Rx Rx1, A.4.1) We draw 2000 samples for model training, and test on 50,000 examples drawn from the 750,000 examples in CIFAR-100-c. (CIFAR-100, A.4.2)
Hardware Specification	No	The paper does not provide specific hardware details such as GPU models, CPU types, or memory configurations used for running the experiments. It discusses computational aspects in terms of training models and datasets but does not specify the underlying hardware infrastructure.
Software Dependencies	No	Recalibration model code is taken from the accompanying code releases from Guo et al. (2017)2 (Temperature Scaling) and Kumar et al. (2019)3 (Platt Scaling, Histogram Binning, Platt Binning). We calculate ECEq for q {1, 2} using the python library released by Kumar et al. (2019)4. The sklearn python library (Pedregosa et al., 2011) is used to produce the One-Class SVM and Isolation Forest models. We pre-train a Dense Net-121 model on the Camelyon17 train set using the code from Koh et al. (2021)5. We extract the features, scores, and labels from the 50,000 Image Net validation samples using a pre-trained Res Net34 model from the torchvision library. Data augmentation in training is performed using Aug Mix (Hendrycks et al., 2020). While several libraries and models are mentioned, specific version numbers for Python, scikit-learn, torchvision, or other key software components are not provided.
Experiment Setup	Yes	Our selector g is a shallow fully-connected network (2 hidden layers with dimension 128)... trained with a learning rate of 0.0005, the coverage loss weight λ is set to 32... and the model is trained with 1000 samples for 1000 epochs with a batch size of 100. (Camelyon17, A.3.1) Our selector g is trained with a learning rate of 0.00001, the coverage loss weight λ is set to 32... and the model is trained with 2000 samples for 1000 epochs with a batch size of 200. (Image Net, A.3.2) Our selector g is a shallow fully-connected network (1 hidden layer with dimension 64 and Re Lu activation) trained with a learning rate of 0.0001, the coverage loss weight λ is set to 8, and the model is trained for 50 epochs... with a batch size of 256. (Rx Rx1, A.4)