Rethinking Weight Decay for Robust Fine-Tuning of Foundation Models

Authors: Junjiao Tian, Chengyue Huang, Zsolt Kira

NeurIPS 2024 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimentally, when equipped with SPD, Adam consistently provides better in-distribution generalization and out-of-distribution robustness performance on multiple popular vision and language benchmarks.
Researcher Affiliation Academia Junjiao Tian Georgia Institute of Technology EMAIL Chengyue Huang Georgia Institute of Technology EMAIL Zsolt Kira Georgia Institute of Technology EMAIL
Pseudocode Yes Alg. 1 shows the Adam optimizer with the L2-SP regularization in Eq. 1. The effects of the regularization are highlighted in blue, also shown in Eq. 2. Alg. 2 shows the proposed SPD.
Open Source Code Yes Code available at https://github.com/GT-RIPL/Selective-Projection-Decay.git.
Open Datasets Yes Image Classification. We first analyze the behavior of SPD on conventional image classification datasets Domain Net [22] and Image Net [23]. Semantic Segmentation. We further test SPD on the PASCAL-Context semantic segmentation dataset [29]. Common Sense Reasoning. Moreover, we show that SPD can benefit PEFT fine-tuning on large language models (LLMs). We use the Commonsense-170K dataset [34]. Visual Question Answering. Finally, we demonstrate SPD s superiority on multi-modal task. We use Google s recently released Pali Gemma [36] pretrained on a broad mixture of large-scale visionlanguage tasks. We fine-tune on VQAv2 [37] and test on nine OOD datasets using Lo RA [7].
Dataset Splits Yes The regularization hyper-parameter is found through cross-validation, and the model with the best ID validation accuracy is taken.
Hardware Specification Yes We use 1 A40 GPU for each experiment. ... We use 2 A40 GPUs for each experiment. ... We use 4 2080Ti GPUs for each experiment. ... We use 1 A40 GPU for each experiment. ... We use 8 A40 GPU for each experiment.
Software Dependencies No The paper mentions using specific external repositories for training code (e.g., DEIT [46], prior work [30], prior work [34], LAVIS [56]) and standard augmentations (Mixup, Cutmix), along with optimizers (Adam W), but does not specify version numbers for these software components or underlying libraries like PyTorch/TensorFlow.
Experiment Setup Yes Standard augmentations are used for all: weight-decay (0.1), drop-path (0.2) [52], label-smoothing (0.1) [53], Mixup (0.8) [54] and Cutmix (1.0) [55]. The learning rate is 2e 5 and trained for 60 epochs for Tab. 1 and 30 epochs for Tab. 2. We use λ = 1 for all Adam-SPD results in Tab. 1.