Revisiting Random Weight Perturbation for Efficiently Improving Generalization
Authors: Tao Li, Qinghua Tao, Weihao Yan, Yingwen Wu, Zehao Lei, Kun Fang, Mingzhen He, Xiaolin Huang
TMLR 2024 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Through extensive experimental evaluations, we demonstrate that our enhanced RWP methods achieve greater efficiency in enhancing generalization, particularly in large-scale problems, while also offering comparable or even superior performance to SAM. The code is released at https://github.com/nblt/mARWP. In this section, we present extensive experimental results to demonstrate the efficiency and effectiveness of our proposed methods. We begin by introducing the experimental setup and then evaluate the performance over three standard benchmark datasets: CIFAR-10, CIFAR-100 and Image Net. We also conduct ablation studies on the hyper-parameters and visualize the loss landscape to provide further insights. |
| Researcher Affiliation | Academia | Tao Li EMAIL Shanghai Jiao Tong University Qinghua Tao EMAIL KU Leuven Weihao Yan EMAIL Shanghai Jiao Tong University Yingwen Wu EMAIL Shanghai Jiao Tong University Zehao Lei EMAIL Shanghai Jiao Tong University Kun Fang EMAIL Shanghai Jiao Tong University Mingzhen He EMAIL Shanghai Jiao Tong University Xiaolin Huang EMAIL Shanghai Jiao Tong University |
| Pseudocode | No | The paper describes methods and algorithms mathematically and in prose, but it does not contain any clearly labeled pseudocode or algorithm blocks. For example, Section 5 "Improving Random Weight Perturbation" details the proposed methods with mathematical formulations and textual explanations. |
| Open Source Code | Yes | The code is released at https://github.com/nblt/mARWP. |
| Open Datasets | Yes | We experiment over three benchmark image classification tasks: CIFAR-10, CIFAR100 (Krizhevsky & Hinton, 2009), and Image Net (Deng et al., 2009). |
| Dataset Splits | Yes | We experiment over three benchmark image classification tasks: CIFAR-10, CIFAR100 (Krizhevsky & Hinton, 2009), and Image Net (Deng et al., 2009). For CIFAR, we apply standard random horizontal flipping, cropping, normalization, and Cutout augmentation (De Vries & Taylor, 2017) (except for ViT, for which we use RandAugment (Cubuk et al., 2020)). For Image Net, we apply basic data preprocessing and augmentation following the public Pytorch example (Paszke et al., 2017). Mean and standard deviation are calculated over three independent trials. |
| Hardware Specification | No | The paper does not provide specific hardware details such as GPU models, CPU types, or memory specifications. It only describes training parameters like batch size, epochs, and learning rates. For example, in "Training Settings": "For CIFAR experiments, we set the training epochs to 200 with batch size 256, momentum 0.9, and weight decay 0.001..." |
| Software Dependencies | No | The paper mentions using a "public Pytorch example" and optimizers like "Adam (Kingma & Ba, 2015)", but it does not specify version numbers for any software libraries, frameworks, or environments used in the experiments. |
| Experiment Setup | Yes | For CIFAR experiments, we set the training epochs to 200 with batch size 256, momentum 0.9, and weight decay 0.001 (Du et al., 2022a; Zhao et al., 2022a), keeping the same among all methods for a fair comparison (except for ViT, we adopt a longer training schedule and provide the details in Appendix E). For SAM, we conduct a grid search for ρ over {0.005, 0.01, 0.02, 0.05, 0.1, 0.2, 0.5}... For RWP and ARWP, we search σ over {0.005, 0.01, 0.015, 0.02} and set σ = 0.01... For m-RWP and m-ARWP, we use σ = 0.015 and λ = 0.5... We set η = 0.1 and β = 0.99 as default choice. For Image Net experiments, we set the training epochs to 90 with batch size 256, weight decay 0.0001, and momentum 0.9. We use ρ = 0.05 for SAM... σ = 0.003 for RWP and ARWP, and σ = 0.005, λ = 0.5 for m-ARWP. We employ m-sharpness with m = 128 for SAM... For all experiments, we adopt cosine learning rate decay (Loshchilov & Hutter, 2016) with an initial learning rate of 0.1 and record the final model performance on the test set. |