A Note on "Assessing Generalization of SGD via Disagreement"
Authors: Andreas Kirsch, Yarin Gal
TMLR 2022 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this reproduction, we show that the suggested theory might be impractical because a deep ensemble s calibration can deteriorate as prediction disagreement increases, which is precisely when the coupling of test error and disagreement is of interest, while labels are needed to estimate the calibration on new datasets1. Further, we simplify the theoretical statements and proofs, showing them to be straightforward within a probabilistic context, unlike the original hypothesis space view employed by Jiang et al. (2022). Lastly, we draw connections and show that the class-aggregated calibration error and the class-wise calibration error 4 are equivalent to the adaptive calibration error and static calibration error introduced in Nixon et al. (2019) and its implementation. Finally, in 5, we provide empirical evidence that deep ensembles are less calibrated exactly when their ensemble members disagree. |
| Researcher Affiliation | Academia | Andreas Kirsch EMAIL Yarin Gal EMAIL OATML, Department of Computer Science University of Oxford |
| Pseudocode | No | The paper does not contain any explicitly labeled pseudocode blocks or algorithm sections. Methodologies are described through text and mathematical formulations. |
| Open Source Code | Yes | 1Code at https://github.com/BlackHC/2202.01851 |
| Open Datasets | Yes | In our empirical falsification using models trained on CIFAR-10 (Krizhevsky et al., 2009) and evaluated on the test sets of CIFAR-10 and CINIC-10, as a dataset with a distribution shift, we find in both cases that calibration deteriorates under increasing disagreement. We further examine Image NET and PACS in E.2. Most importantly though, calibration markedly worsens under distribution shift. Specifically, we examine an ensemble of 25 Wide Res Net models (Zagoruyko & Komodakis, 2016) trained on CIFAR-10 (Krizhevsky et al., 2009) and evaluated on CIFAR-10 and CINIC-10 test data. CINIC-10 (Darlow et al., 2018) consists of CIFAR-10 and downscaled Image Net samples for the same classes, and thus includes a distribution shift. We also observe the same for Image Net (Deng et al., 2009) and PACS (Li et al., 2017), which we show in appendix E.2. |
| Dataset Splits | Yes | In our empirical falsification using models trained on CIFAR-10 and evaluated on the test sets of CIFAR-10 and CINIC-10, as a dataset with a distribution shift, we find in both cases that calibration deteriorates under increasing disagreement. We use pretrained models with various architectures (...) which we fine-tune on PACS photo domain (...) and evaluate it on PACS art painting , sketch , and cartoon domains. |
| Hardware Specification | No | The paper describes experimental setups, including model architectures and training parameters, but does not specify the hardware used (e.g., GPU, CPU models, or memory). Phrases like 'models trained on' or 'fine-tune' do not convey specific hardware details. |
| Software Dependencies | No | We use Py Torch (Paszke et al., 2019) for all experiments. (...) from the timm package (Wightman, 2019) as base models. |
| Experiment Setup | Yes | CIFAR-10 and CINIC-10. We follow the training setup from Mukhoti et al. (2021): we train 25 Wide Res Net-28-10 models (Zagoruyko & Komodakis, 2016) for 350 epochs on CIFAR-10. We use SGD with a learning rate of 0.1 and momentum of 0.9. We use a learning rate schedule with a decay of 10 at 150 and 250 epochs. Image Net and PACS. (...) We freeze all weights except for the final linear layer, which we fine-tune on PACS photo domain using Adam (Kingma & Ba, 2014) with learning rate 5e-3 and batch size 128 for 1000 steps. |