νSAM: Memory-Efficient Sharpness-Aware Minimization via Nuclear Norm Constraints

Authors: Thomas Pethick, Parameswaran Raman, Lenon Minorics, Mingyi Hong, Shoham Sabach, Volkan Cevher

TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We extensively evaluate our method νSAM on vision transformers (Vi Ts) and MLP-Mixer models when both fine-tuning and training from scratch, and additionally fine-tune BERT on a set of language tasks. We find that νSAM consistently outperforms the baseline Adam W in all cases and achieves comparable performance with SAM. Surprisingly, this is the case even when the low-rank decomposition of the perturbation ε is only coarsely approximated through a single power iteration, which avoid adding any wall-clock time as compared with SAM. Interestingly, we find that νSAM enjoys a substantial improvement over SAM for MLP-Mixer models on both fine-tuning and when training from scratch.
Researcher Affiliation Collaboration Thomas Pethick EMAIL LIONS, IEM, STI, Ecole Polytechnique Fédérale de Lausanne Parameswaran Raman EMAIL Amazon Web Services Lenon Minorics EMAIL Amazon Web Services Mingyi Hong EMAIL University of Minnesota Shoham Sabach EMAIL Faculty of Data and Decision Sciences, Technion Israel Institute of Technology Volkan Cevher EMAIL LIONS, IEM, STI, Ecole Polytechnique Fédérale de Lausanne
Pseudocode Yes Algorithm 1 Nuclear norm based sharpness-aware minimization (νSAM) Algorithm 2 Top singular value decomposition (SVDtop1)
Open Source Code No The paper does not explicitly state that source code for their methodology is being released or provide a link to a repository. It mentions Fair Scale implementation (Fair Scale authors, 2021) which is a third-party tool.
Open Datasets Yes We train multiple sizes of Vi Ts and MLP-Mixer on CIFAR10/100 from scratch. We fine-tune a pretrained BERT-base (uncased) (Devlin et al., 2018) on the GLUE benchmark (Wang et al., 2018)
Dataset Splits No The paper mentions using CIFAR10/100 and GLUE benchmark, which are standard datasets with predefined splits, but it does not explicitly detail the train/test/validation splits used for their experiments. While standard splits are implied, specific percentages or sample counts for their usage are not provided within the main text.
Hardware Specification Yes All experiments are run on either a single NVIDIA V100 GPU.
Software Dependencies No The paper mentions popular optimizers like Adam (Kingma, 2014) and Adam W (Loshchilov & Hutter, 2017) and frameworks like PyTorch, but does not provide specific version numbers for any of these software dependencies.
Experiment Setup Yes Baseline & hyperparameters Since there is a lack of good hyperparameter defaults for these architectures on small datasets, we first find a good configuration for the base optimizer Adam W on CIFAR10. Table 7: Baseline (Adam W) hyperparameters for training from scratch. We use a cosine learning rate schedule with linear warmup. We use standard augmentations (random cropping and flipping) and Auto Augment. Hyperparameter Value Learning rate 0.0005 Label smoothing 0.1 Weight decay 0.05 Warmup epoch 10% Epochs 300 Dropout rate 0.0 Drop path rate 0.1 Gradient clipping Disabled Batch size 128 Table 11: Baseline (Adam W) hyperparameters for fine-tuning. We use a cosine learning rate schedule. Table 12: Hyperparameters for fine-tuning on GLUE.