reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Rethinking Weight Decay for Robust Fine-Tuning of Foundation Models

Authors: Junjiao Tian, Chengyue Huang, Zsolt Kira

NeurIPS 2024 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experimentally, when equipped with SPD, Adam consistently provides better in-distribution generalization and out-of-distribution robustness performance on multiple popular vision and language benchmarks.
Researcher Affiliation	Academia	Junjiao Tian Georgia Institute of Technology EMAIL Chengyue Huang Georgia Institute of Technology EMAIL Zsolt Kira Georgia Institute of Technology EMAIL
Pseudocode	Yes	Alg. 1 shows the Adam optimizer with the L2-SP regularization in Eq. 1. The effects of the regularization are highlighted in blue, also shown in Eq. 2. Alg. 2 shows the proposed SPD.
Open Source Code	Yes	Code available at https://github.com/GT-RIPL/Selective-Projection-Decay.git.
Open Datasets	Yes	Image Classification. We first analyze the behavior of SPD on conventional image classification datasets Domain Net [22] and Image Net [23]. Semantic Segmentation. We further test SPD on the PASCAL-Context semantic segmentation dataset [29]. Common Sense Reasoning. Moreover, we show that SPD can benefit PEFT fine-tuning on large language models (LLMs). We use the Commonsense-170K dataset [34]. Visual Question Answering. Finally, we demonstrate SPD s superiority on multi-modal task. We use Google s recently released Pali Gemma [36] pretrained on a broad mixture of large-scale visionlanguage tasks. We fine-tune on VQAv2 [37] and test on nine OOD datasets using Lo RA [7].
Dataset Splits	Yes	The regularization hyper-parameter is found through cross-validation, and the model with the best ID validation accuracy is taken.
Hardware Specification	Yes	We use 1 A40 GPU for each experiment. ... We use 2 A40 GPUs for each experiment. ... We use 4 2080Ti GPUs for each experiment. ... We use 1 A40 GPU for each experiment. ... We use 8 A40 GPU for each experiment.
Software Dependencies	No	The paper mentions using specific external repositories for training code (e.g., DEIT [46], prior work [30], prior work [34], LAVIS [56]) and standard augmentations (Mixup, Cutmix), along with optimizers (Adam W), but does not specify version numbers for these software components or underlying libraries like PyTorch/TensorFlow.
Experiment Setup	Yes	Standard augmentations are used for all: weight-decay (0.1), drop-path (0.2) [52], label-smoothing (0.1) [53], Mixup (0.8) [54] and Cutmix (1.0) [55]. The learning rate is 2e 5 and trained for 60 epochs for Tab. 1 and 30 epochs for Tab. 2. We use λ = 1 for all Adam-SPD results in Tab. 1.