Normalization Layers Are All That Sharpness-Aware Minimization Needs

Authors: Maximilian Mueller, Tiffany Vlaar, David Rolnick, Matthias Hein

NeurIPS 2023 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We showcase the effect of SAM-ON, i.e. only applying SAM to the Batch Norm parameters, for a Wide Res Net-28-10 (WRN-28) on CIFAR-100 in Figure 1. We observe that SAM-ON obtains higher accuracy than conventional SAM (SAM-all) for all SAM variants considered (more SAMvariants are shown in Figure 6 in the Appendix). ... We report mean accuracy and standard deviation over 3 seeds for CIFAR-100 in Table 1.
Researcher Affiliation Academia Maximilian Müller University of Tübingen and Tübingen AI Center EMAIL Tiffany Vlaar Mc Gill University and Mila Quebec AI Institute EMAIL David Rolnick Mc Gill University and Mila Quebec AI Institute EMAIL Matthias Hein University of Tübingen and Tübingen AI Center EMAIL
Pseudocode No No structured pseudocode or algorithm blocks with labels like 'Pseudocode' or 'Algorithm' were found.
Open Source Code Yes Code is provided at https://github.com/mueller-mp/SAM-ON.
Open Datasets Yes We showcase the effect of SAM-ON...on CIFAR-100 in Figure 1.
Dataset Splits No No explicit statement providing specific training/validation/test dataset splits (e.g., percentages or sample counts for a validation set) was found. The paper primarily discusses training and test phases, for example, 'We train models for 200 epochs' and reporting 'Test Accuracy (%).'
Hardware Specification Yes We train a Res Net-50 for 100 epochs on eight 2080-Ti GPUs with m = 64, leading to an overall batch-size of 512.
Software Dependencies No The paper mentions 'Py Torch' and refers to 'timm training script [53]' but does not provide specific version numbers for these or other software dependencies, which are necessary for full reproducibility.
Experiment Setup Yes For Res Nets, we follow [37] and adopt a learning rate of 0.1, momentum of 0.9, weight decay of 0.0005 and use label smoothing with a factor of 0.1.