reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Provable Uncertainty Decomposition via Higher-Order Calibration

Authors: Gustaf Ahdritz, Aravind Gollakota, Parikshit Gopalan, Charlotte Peale, Udi Wieder

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We demonstrate through experiments that our method produces meaningful uncertainty decompositions for image classification. ... We verify that our methods yield useful decompositions in real image classification tasks.
Researcher Affiliation	Collaboration	Gustaf Ahdritz Harvard University Aravind Gollakota Apple Parikshit Gopalan Apple Charlotte Peale Stanford University Udi Wieder Apple
Pseudocode	No	The paper describes methods in Section 2.2 'ACHIEVING kth-ORDER CALIBRATION' but does not present them in a structured pseudocode or algorithm block format. It mentions 'Algorithm 2' in Appendix G.4 but this refers to an algorithm from an external paper, not one explicitly presented within this document.
Open Source Code	No	The paper mentions using 'the Python implementation of the network from the uncertainty baselines package' and 'enn Python library' which are third-party tools, but does not provide concrete access information (link or explicit statement) to source code for their own methodology described in the paper.
Open Datasets	Yes	We focus on the task of classifying ambiguous images using CIFAR-10H (Peterson et al., 2019), a relabeling of the test set of CIFAR-10 (Krizhevsky, 2009). ... we also compare our algorithms on the FER+ (Barsoum et al., 2016) dataset.
Dataset Splits	Yes	We first train a regular 1-snapshot ResNet... on 45,000 images from the CIFAR training set, setting aside the remaining 5,000 for validation. We then apply our post-hoc calibration algorithm... to CIFAR-10H, using half as a calibration set and the other half as a test set. ... For the post-hoc calibration algorithm, the calibration set is composed of half of the test set. ... For FER+ dataset, we use a further 80/10/10 train/val/test split of the dataset.
Hardware Specification	No	The paper does not provide specific hardware details such as GPU models, CPU models, or memory specifications. It only mentions training a 'wide ResNet' and 'neural network'.
Software Dependencies	No	The paper mentions 'Adam W optimizer (Kingma & Ba, 2014; Loshchilov & Hutter, 2017)', 'uncertainty baselines package (Nado et al., 2021)', and 'enn Python library (Osband et al., 2023)'. However, no specific version numbers are provided for these software dependencies or programming languages.
Experiment Setup	Yes	We use a learning rate of 3.799e-3 and relatively large weight decay of 3.656e-1. We train the model for 50 epochs with the Adam W optimizer (Kingma & Ba, 2014; Loshchilov & Hutter, 2017). For the first epoch, we warm up the learning rate and apply cosine decay thereafter. All models are trained using AugMix data augmentation (Hendrycks et al., 2020) with the standard hyperparameters used in the uncertainty baselines package.