ADePT: Adaptive Decomposed Prompt Tuning for Parameter-Efficient Fine-tuning

Authors: Pengwei Tang, Xiaolin Hu, Yong Liu

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In comprehensive experiments across 23 natural language processing tasks and 4 typical PLMs of different scales, ADe PT consistently surpasses the other leading parameter-efficient fine-tuning methods, and even outperforms the full fine-tuning in certain scenarios. We also provide a theoretical analysis towards ADe PT. Code is available at https://github.com/Hunger PWAY/ADe PT.
Researcher Affiliation Academia Pengwei Tang, Xiaolin Hu, Yong Liu Renmin University of China, Beijing, China EMAIL
Pseudocode No The paper describes methods using mathematical formulations and textual explanations but does not include any explicit 'Pseudocode' or 'Algorithm' labeled blocks or figures.
Open Source Code Yes Code is available at https://github.com/Hunger PWAY/ADe PT.
Open Datasets Yes We consider four benchmarks and 4 other datasets: (1) GLUE (Wang et al., 2018) benchmark, which includes MNLI (Williams et al., 2018), QQP1, QNLI (Rajpurkar et al., 2016), SST-2 (Socher et al., 2013), STS-B (Cer et al., 2017), MRPC (Dolan & Brockett, 2005), RTE (Giampiccolo et al., 2007) and Co LA (Warstadt et al., 2019); (2) Super GLUE benchmark (Wang et al., 2019), which includes Multi RC (Khashabi et al., 2018), Bool Q (Clark et al., 2019), Wi C (Pilehvar & Camacho Collados, 2019), WSC (Levesque et al., 2012), CB (De Marneffe et al., 2019) and Re Co RD (Zhang et al., 2018); (3) MRQA 2019 Shared Task (Fisch et al., 2019), which includes Natural Questions (Kwiatkowski et al., 2019), Hotpot QA (Yang et al., 2018), Search QA (Dunn et al., 2017) and News QA (Trischler et al., 2017); (4) MBPP benchmark (Austin et al., 2021), which is a code generation task; (5) other datasets, which includes Wino Grande (Sakaguchi et al., 2021), Yelp-2 (Zhang et al., 2015), Sci Tail (Khot et al., 2018) and PAWS-Wiki (Zhang et al., 2019).
Dataset Splits Yes For the MBPP benchmark, following Jain et al. (2024), we use a 50-50 split for training and test. Table 16: The datasets assessed in this study are described as follows. The term Train refers to the number of samples in the training set, whereas Valid and Test indicate the number of samples in the validation set and test set, respectively. (Table 16 then lists specific Train, Valid, Test counts for all datasets.)
Hardware Specification No The paper mentions 'GPU resources' and 'computational resources' in general terms, but does not provide specific details on the hardware used, such as GPU models, CPU types, or memory specifications.
Software Dependencies No We implement our experiments by using Pytorch2, Huggingface Transformers3, and Huggingface PEFT 4. (The numbers 2, 3, 4 refer to footnotes for URLs, not specific version numbers of the software.)
Experiment Setup Yes Following Shi & Lipani (2024), we use 100 learnable virtual tokens as the soft prompt of PT. For our proposed ADe PT, we adjust the hyperparameters to maintain an equivalent number of trainable parameters as PT... For the T5-base model, the token embedding dimension d is 768... we search the length of soft prompt from 20, 40, 60, and 80... For the T5-base model, we directly quote performance metrics from published papers... For T5-3B model, we consistently use 60 virtual tokens and bottleneck size r = 19... For small datasets (< 70,000 training samples) based on T5 model, we follow the learning strategy of Shi & Lipani (2024): we search the learning rate for the soft prompt from 3e-1, 4e-1, 5e-1, and for the feed-forward neural network from 1e-4, 1e-5. For large datasets (> 70,000 training samples) based on T5 model, we use learning rate 3e-1 for the soft prompt and 1e-4 for the feed-forward neural networks. For the MBPP benchmark, following Jain et al. (2024), we use learning rates of 1e-3 for the prompting-style tuning method, 1e-4 for Lo RA. Appendix E provides detailed hyperparameters in Table 17, Table 18, and Table 19, including 'number of steps', 'batch size', 'maximum learning rate', 'length of the soft prompt', 'maximum sequence length', 'learning rate optimizer Adam W', 'Adam epsilon', 'Adam beta weights', 'learning rate scheduler Warmup linear', 'Weight decay', and 'Warmup steps'.