Revisiting Random Weight Perturbation for Efficiently Improving Generalization

Authors: Tao Li, Qinghua Tao, Weihao Yan, Yingwen Wu, Zehao Lei, Kun Fang, Mingzhen He, Xiaolin Huang

TMLR 2024 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Through extensive experimental evaluations, we demonstrate that our enhanced RWP methods achieve greater efficiency in enhancing generalization, particularly in large-scale problems, while also offering comparable or even superior performance to SAM. The code is released at https://github.com/nblt/mARWP. In this section, we present extensive experimental results to demonstrate the efficiency and effectiveness of our proposed methods. We begin by introducing the experimental setup and then evaluate the performance over three standard benchmark datasets: CIFAR-10, CIFAR-100 and Image Net. We also conduct ablation studies on the hyper-parameters and visualize the loss landscape to provide further insights.
Researcher Affiliation Academia Tao Li EMAIL Shanghai Jiao Tong University Qinghua Tao EMAIL KU Leuven Weihao Yan EMAIL Shanghai Jiao Tong University Yingwen Wu EMAIL Shanghai Jiao Tong University Zehao Lei EMAIL Shanghai Jiao Tong University Kun Fang EMAIL Shanghai Jiao Tong University Mingzhen He EMAIL Shanghai Jiao Tong University Xiaolin Huang EMAIL Shanghai Jiao Tong University
Pseudocode No The paper describes methods and algorithms mathematically and in prose, but it does not contain any clearly labeled pseudocode or algorithm blocks. For example, Section 5 "Improving Random Weight Perturbation" details the proposed methods with mathematical formulations and textual explanations.
Open Source Code Yes The code is released at https://github.com/nblt/mARWP.
Open Datasets Yes We experiment over three benchmark image classification tasks: CIFAR-10, CIFAR100 (Krizhevsky & Hinton, 2009), and Image Net (Deng et al., 2009).
Dataset Splits Yes We experiment over three benchmark image classification tasks: CIFAR-10, CIFAR100 (Krizhevsky & Hinton, 2009), and Image Net (Deng et al., 2009). For CIFAR, we apply standard random horizontal flipping, cropping, normalization, and Cutout augmentation (De Vries & Taylor, 2017) (except for ViT, for which we use RandAugment (Cubuk et al., 2020)). For Image Net, we apply basic data preprocessing and augmentation following the public Pytorch example (Paszke et al., 2017). Mean and standard deviation are calculated over three independent trials.
Hardware Specification No The paper does not provide specific hardware details such as GPU models, CPU types, or memory specifications. It only describes training parameters like batch size, epochs, and learning rates. For example, in "Training Settings": "For CIFAR experiments, we set the training epochs to 200 with batch size 256, momentum 0.9, and weight decay 0.001..."
Software Dependencies No The paper mentions using a "public Pytorch example" and optimizers like "Adam (Kingma & Ba, 2015)", but it does not specify version numbers for any software libraries, frameworks, or environments used in the experiments.
Experiment Setup Yes For CIFAR experiments, we set the training epochs to 200 with batch size 256, momentum 0.9, and weight decay 0.001 (Du et al., 2022a; Zhao et al., 2022a), keeping the same among all methods for a fair comparison (except for ViT, we adopt a longer training schedule and provide the details in Appendix E). For SAM, we conduct a grid search for ρ over {0.005, 0.01, 0.02, 0.05, 0.1, 0.2, 0.5}... For RWP and ARWP, we search σ over {0.005, 0.01, 0.015, 0.02} and set σ = 0.01... For m-RWP and m-ARWP, we use σ = 0.015 and λ = 0.5... We set η = 0.1 and β = 0.99 as default choice. For Image Net experiments, we set the training epochs to 90 with batch size 256, weight decay 0.0001, and momentum 0.9. We use ρ = 0.05 for SAM... σ = 0.003 for RWP and ARWP, and σ = 0.005, λ = 0.5 for m-ARWP. We employ m-sharpness with m = 128 for SAM... For all experiments, we adopt cosine learning rate decay (Loshchilov & Hutter, 2016) with an initial learning rate of 0.1 and record the final model performance on the test set.