DARE the Extreme: Revisiting Delta-Parameter Pruning For Fine-Tuned Models
Authors: Wenlong Deng, Yize Zhao, Vala Vakilian, Minghui Chen, Xiaoxiao Li, Christos Thrampoulidis
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We conduct extensive experiments on both encoder-decoder and decoder-only models across a range of downstream tasks. The results demonstrate the effectiveness of both DAREx-q and DAREx-L2 algorithms. As summarized in Table 1, applying these techniques to a fine-tuned BERT model on the Co LA and SST-2 datasets leads to substantial performance improvements, consistently exceeding 35%. Additional results are presented in Tables 2,3,5,7 and Figures 2,3. |
| Researcher Affiliation | Academia | Wenlong Deng1,2, Yize Zhao1, Vala Vakilian1, Minghui Chen1,2, Xiaoxiao Li1,2, Christos Thrampoulidis1 1The University of British Columbia 2Vector Institute |
| Pseudocode | Yes | The detailed algorithm, which we call Adam R, is presented in Algorithm 1. Algorithm 1 Adam R Algorithm 2 Empirically find q Algorithm 3 Analytically calculate q |
| Open Source Code | Yes | https://github.com/vengdeng/DAREx.git |
| Open Datasets | Yes | For Encoder-based LMs, we utilize four datasets sentence acceptability dataset Co LA (Warstadt et al., 2019), sentiment detection dataset SST2 (Socher et al., 2013), paraphrase dataset MRPC (Dolan & Brockett, 2005), and sentence similarity dataset STS-B (Cer et al., 2017). For Decoder-based LMs, we focus on mathematical reasoning tasks. ...Additionally, we utilize publicly available mathematical reasoning models, including the Meta Math-llema-7B (Yu et al., 2023b), Meta Math-7B (Yu et al., 2023b), Wizard Math7B (Luo et al., 2023) and Abel-7B (Chern et al., 2023), all based on the Llama2-7B architecture (Touvron et al., 2023). We then use the GSM8K (Cobbe et al., 2021) to test these models. |
| Dataset Splits | Yes | We use a validation dataset {xv, yv} V to determine the best rescaling factor 1/qv that maximizes test performance (eqv. minimizes test error) on the validation set. Specifically, we select qv = arg minq PV(fq(xv) = yv), where fq represents the pruned model rescaled by 1/q. ... (b) minimizing mean output change over unlabeled data. ... Table 2: ... We report average performance and standard deviation on the test set over four independent runs. ... Evaluation metrics. ... the Matthews correlation coefficient for Co LA, accuracy for SST-2, a combined score of accuracy and F1 for MRPC, the mean of Pearson and Spearman correlations for STS-B, and zero-shot accuracy for GSM8K. |
| Hardware Specification | Yes | Experiments are conducted on NVIDIA A100 GPUs. |
| Software Dependencies | No | The paper mentions models like BERT-base-uncased, RoBERTa-base, Qwen2-0.5B, and Llama2-7B, and optimizers like AdamW and AdamR. However, specific version numbers for software libraries such as Python, PyTorch, or TensorFlow are not provided. |
| Experiment Setup | Yes | For decoder LLMs, following Yu et al. (2023a), we set the temperature to 0.0 for greedy decoding and limit the maximum number of generated tokens to 1,024 on GSM8K. For encoder-based LMs, we fine-tune BERT-base-uncased and Ro BERTa-base for 10 epochs using a warmup strategy and a learning rate of 1e-4. ... The pruning rate is set to p = 0.99. |