reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

DARE the Extreme: Revisiting Delta-Parameter Pruning For Fine-Tuned Models

Authors: Wenlong Deng, Yize Zhao, Vala Vakilian, Minghui Chen, Xiaoxiao Li, Christos Thrampoulidis

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We conduct extensive experiments on both encoder-decoder and decoder-only models across a range of downstream tasks. The results demonstrate the effectiveness of both DAREx-q and DAREx-L2 algorithms. As summarized in Table 1, applying these techniques to a fine-tuned BERT model on the Co LA and SST-2 datasets leads to substantial performance improvements, consistently exceeding 35%. Additional results are presented in Tables 2,3,5,7 and Figures 2,3.
Researcher Affiliation	Academia	Wenlong Deng1,2, Yize Zhao1, Vala Vakilian1, Minghui Chen1,2, Xiaoxiao Li1,2, Christos Thrampoulidis1 1The University of British Columbia 2Vector Institute
Pseudocode	Yes	The detailed algorithm, which we call Adam R, is presented in Algorithm 1. Algorithm 1 Adam R Algorithm 2 Empirically find q Algorithm 3 Analytically calculate q
Open Source Code	Yes	https://github.com/vengdeng/DAREx.git
Open Datasets	Yes	For Encoder-based LMs, we utilize four datasets sentence acceptability dataset Co LA (Warstadt et al., 2019), sentiment detection dataset SST2 (Socher et al., 2013), paraphrase dataset MRPC (Dolan & Brockett, 2005), and sentence similarity dataset STS-B (Cer et al., 2017). For Decoder-based LMs, we focus on mathematical reasoning tasks. ...Additionally, we utilize publicly available mathematical reasoning models, including the Meta Math-llema-7B (Yu et al., 2023b), Meta Math-7B (Yu et al., 2023b), Wizard Math7B (Luo et al., 2023) and Abel-7B (Chern et al., 2023), all based on the Llama2-7B architecture (Touvron et al., 2023). We then use the GSM8K (Cobbe et al., 2021) to test these models.
Dataset Splits	Yes	We use a validation dataset {xv, yv} V to determine the best rescaling factor 1/qv that maximizes test performance (eqv. minimizes test error) on the validation set. Specifically, we select qv = arg minq PV(fq(xv) = yv), where fq represents the pruned model rescaled by 1/q. ... (b) minimizing mean output change over unlabeled data. ... Table 2: ... We report average performance and standard deviation on the test set over four independent runs. ... Evaluation metrics. ... the Matthews correlation coefficient for Co LA, accuracy for SST-2, a combined score of accuracy and F1 for MRPC, the mean of Pearson and Spearman correlations for STS-B, and zero-shot accuracy for GSM8K.
Hardware Specification	Yes	Experiments are conducted on NVIDIA A100 GPUs.
Software Dependencies	No	The paper mentions models like BERT-base-uncased, RoBERTa-base, Qwen2-0.5B, and Llama2-7B, and optimizers like AdamW and AdamR. However, specific version numbers for software libraries such as Python, PyTorch, or TensorFlow are not provided.
Experiment Setup	Yes	For decoder LLMs, following Yu et al. (2023a), we set the temperature to 0.0 for greedy decoding and limit the maximum number of generated tokens to 1,024 on GSM8K. For encoder-based LMs, we fine-tune BERT-base-uncased and Ro BERTa-base for 10 epochs using a warmup strategy and a learning rate of 1e-4. ... The pruning rate is set to p = 0.99.