reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

A Note on "Assessing Generalization of SGD via Disagreement"

Authors: Andreas Kirsch, Yarin Gal

TMLR 2022 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this reproduction, we show that the suggested theory might be impractical because a deep ensemble s calibration can deteriorate as prediction disagreement increases, which is precisely when the coupling of test error and disagreement is of interest, while labels are needed to estimate the calibration on new datasets1. Further, we simplify the theoretical statements and proofs, showing them to be straightforward within a probabilistic context, unlike the original hypothesis space view employed by Jiang et al. (2022). Lastly, we draw connections and show that the class-aggregated calibration error and the class-wise calibration error 4 are equivalent to the adaptive calibration error and static calibration error introduced in Nixon et al. (2019) and its implementation. Finally, in 5, we provide empirical evidence that deep ensembles are less calibrated exactly when their ensemble members disagree.
Researcher Affiliation	Academia	Andreas Kirsch EMAIL Yarin Gal EMAIL OATML, Department of Computer Science University of Oxford
Pseudocode	No	The paper does not contain any explicitly labeled pseudocode blocks or algorithm sections. Methodologies are described through text and mathematical formulations.
Open Source Code	Yes	1Code at https://github.com/BlackHC/2202.01851
Open Datasets	Yes	In our empirical falsification using models trained on CIFAR-10 (Krizhevsky et al., 2009) and evaluated on the test sets of CIFAR-10 and CINIC-10, as a dataset with a distribution shift, we find in both cases that calibration deteriorates under increasing disagreement. We further examine Image NET and PACS in E.2. Most importantly though, calibration markedly worsens under distribution shift. Specifically, we examine an ensemble of 25 Wide Res Net models (Zagoruyko & Komodakis, 2016) trained on CIFAR-10 (Krizhevsky et al., 2009) and evaluated on CIFAR-10 and CINIC-10 test data. CINIC-10 (Darlow et al., 2018) consists of CIFAR-10 and downscaled Image Net samples for the same classes, and thus includes a distribution shift. We also observe the same for Image Net (Deng et al., 2009) and PACS (Li et al., 2017), which we show in appendix E.2.
Dataset Splits	Yes	In our empirical falsification using models trained on CIFAR-10 and evaluated on the test sets of CIFAR-10 and CINIC-10, as a dataset with a distribution shift, we find in both cases that calibration deteriorates under increasing disagreement. We use pretrained models with various architectures (...) which we fine-tune on PACS photo domain (...) and evaluate it on PACS art painting , sketch , and cartoon domains.
Hardware Specification	No	The paper describes experimental setups, including model architectures and training parameters, but does not specify the hardware used (e.g., GPU, CPU models, or memory). Phrases like 'models trained on' or 'fine-tune' do not convey specific hardware details.
Software Dependencies	No	We use Py Torch (Paszke et al., 2019) for all experiments. (...) from the timm package (Wightman, 2019) as base models.
Experiment Setup	Yes	CIFAR-10 and CINIC-10. We follow the training setup from Mukhoti et al. (2021): we train 25 Wide Res Net-28-10 models (Zagoruyko & Komodakis, 2016) for 350 epochs on CIFAR-10. We use SGD with a learning rate of 0.1 and momentum of 0.9. We use a learning rate schedule with a decay of 10 at 150 and 250 epochs. Image Net and PACS. (...) We freeze all weights except for the final linear layer, which we fine-tune on PACS photo domain using Adam (Kingma & Ba, 2014) with learning rate 5e-3 and batch size 128 for 1000 steps.