Sharpness-Aware Minimization Efficiently Selects Flatter Minima Late In Training

Authors: Zhanpeng Zhou, Mingze Wang, Yuchen Mao, Bingrui Li, Junchi Yan

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our experiments are conducted using SAM in Equation (3), whereas our theoretical analyses in Section 4 apply the simplified SAM in Equation (4). Specifically, we perform experiments on the commonly used image classification datasets CIFAR-10/100 (Krizhevsky et al., 2009), with standard architectures such as Wide Res Net (Zagoruyko & Komodakis, 2016), Res Net (He et al., 2016), and VGG (Simonyan & Zisserman, 2015).
Researcher Affiliation Academia 1Sch. of Computer Science & Sch. of Artificial Intelligence, Shanghai Jiao Tong University 2Peking University 3Tsinghua University 4Shanghai Artificial Intelligence Laboratory EMAIL
Pseudocode No No explicit pseudocode or algorithm blocks are provided. The paper describes methods through mathematical equations and text.
Open Source Code Yes We released our source code at https://github.com/zzp1012/SAM-in-Late-Phase.
Open Datasets Yes We perform experiments on the commonly used image classification datasets CIFAR-10/100 (Krizhevsky et al., 2009)
Dataset Splits Yes We perform experiments on the commonly used image classification datasets CIFAR-10/100 (Krizhevsky et al., 2009), with standard architectures such as Wide Res Net (Zagoruyko & Komodakis, 2016), Res Net (He et al., 2016), and VGG (Simonyan & Zisserman, 2015). We use the standard configurations for the basic training settings shared by SAM and SGD (e.g., learning rate, batch size, and data augmentation) as in the original papers
Hardware Specification No No specific hardware details (like GPU/CPU models or processor types) are mentioned in the paper.
Software Dependencies No The paper mentions 'an implementation limitation in Py Torch' but does not specify a version number for PyTorch or any other software dependencies.
Experiment Setup Yes We use the standard configurations for the basic training settings shared by SAM and SGD (e.g., learning rate, batch size, and data augmentation) as in the original papers , and set the SAM-specific perturbation radius ρ to 0.05, as recommended by Foret et al. (2021). ... A weight decay of 5 * 10^-4 is applied, and the momentum for gradient update is set to 0.9. The learning rate is initialized at 0.1 and is dropped by 10 times at epoch 807. The total number of training epochs is 160.