Sharpness-Aware Minimization and the Edge of Stability

Authors: Philip M. Long, Peter L. Bartlett

JMLR 2024 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Using three deep learning training tasks, we see empirically that SAM operates on the edge of stability identified by this analysis.
Researcher Affiliation Collaboration Philip M. Long EMAIL Peter L. Bartlett EMAIL Google 1600 Amphitheatre Parkway Mountain View, CA 94040 . Also affiliated with University of California, Berkeley.
Pseudocode No The paper does not contain explicitly labeled pseudocode or algorithm blocks. It provides mathematical derivations and descriptions of algorithms in prose.
Open Source Code Yes Code is available (Long and Bartlett, 2024). P. M. Long and P. L. Bartlett. Sam and the edge of stability. https://github.com/google-deepmind/sam_edge, 2024.
Open Datasets Yes Our first experiments are with fully connected networks on MNIST. Next, we experiment with a convolutional neural network training on 1000 examples from CIFAR10. Finally, we experiment with a standard Transformer architecture training a language model on tiny_shakespeare using the more practical version of SAM that uses stochastic gradients.
Dataset Splits Yes The last 10000 lines of tiny_shakespeare were set aside as a test set, and the remaining data was used for training.
Hardware Specification Yes We trained for eight hours of wallclock time on a V100 GPU. Training was performed for 12 hours on a V100 GPU.
Software Dependencies Yes We coded our experiments using Jax (Bradbury et al., 2018), along with Flax (Heek et al., 2023) (for the image classification experiments), and Haiku (Hennigan et al., 2020) (for the language model experiments). JAX: composable transformations of Python+NumPy programs, 2018. URL http://github.com/google/jax. Version 0.3.13. Flax: A neural network library and ecosystem for JAX, 2023. URL http://github.com/google/flax. Version 0.7.2. Haiku: Sonnet for JAX, 2020. URL http://github.com/deepmind/dm-haiku. Version 0.0.10.
Experiment Setup Yes We trained once for each combination of the following hyperparameters: learning rates η: 0.03, 0.1, 0.3, SAM offsets ρ (see (1)): 0.0, 0.1, 0.3, 1.0. For CIFAR10, learning rates: 0.0003, 0.001, 0.003, 0.01, ρ values: 0.0, 0.1, 0.3, 1.0. For tiny_shakespeare, learning rates: 0.01, 0.02, 0.05, 0.1, 0.2, 0.5 ρ values: 0.0, 0.1, 0.3, 1.0. ...training an autoregressive character language model using the tiny_shakespeare dataset, using minibatches of size 128.