Normalization Layers Are All That Sharpness-Aware Minimization Needs
Authors: Maximilian Mueller, Tiffany Vlaar, David Rolnick, Matthias Hein
NeurIPS 2023 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We showcase the effect of SAM-ON, i.e. only applying SAM to the Batch Norm parameters, for a Wide Res Net-28-10 (WRN-28) on CIFAR-100 in Figure 1. We observe that SAM-ON obtains higher accuracy than conventional SAM (SAM-all) for all SAM variants considered (more SAMvariants are shown in Figure 6 in the Appendix). ... We report mean accuracy and standard deviation over 3 seeds for CIFAR-100 in Table 1. |
| Researcher Affiliation | Academia | Maximilian Müller University of Tübingen and Tübingen AI Center EMAIL Tiffany Vlaar Mc Gill University and Mila Quebec AI Institute EMAIL David Rolnick Mc Gill University and Mila Quebec AI Institute EMAIL Matthias Hein University of Tübingen and Tübingen AI Center EMAIL |
| Pseudocode | No | No structured pseudocode or algorithm blocks with labels like 'Pseudocode' or 'Algorithm' were found. |
| Open Source Code | Yes | Code is provided at https://github.com/mueller-mp/SAM-ON. |
| Open Datasets | Yes | We showcase the effect of SAM-ON...on CIFAR-100 in Figure 1. |
| Dataset Splits | No | No explicit statement providing specific training/validation/test dataset splits (e.g., percentages or sample counts for a validation set) was found. The paper primarily discusses training and test phases, for example, 'We train models for 200 epochs' and reporting 'Test Accuracy (%).' |
| Hardware Specification | Yes | We train a Res Net-50 for 100 epochs on eight 2080-Ti GPUs with m = 64, leading to an overall batch-size of 512. |
| Software Dependencies | No | The paper mentions 'Py Torch' and refers to 'timm training script [53]' but does not provide specific version numbers for these or other software dependencies, which are necessary for full reproducibility. |
| Experiment Setup | Yes | For Res Nets, we follow [37] and adopt a learning rate of 0.1, momentum of 0.9, weight decay of 0.0005 and use label smoothing with a factor of 0.1. |