reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

EPT: Efficient Prompt Tuning by Multi-Space Projection and Prompt Fusion

Authors: Pengxiang Lan, Enneng Yang, Yuting Liu, Guibing Guo, Jianzhe Zhao, Xingwei Wang

AAAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experiments on 13 natural language processing downstream tasks show that our method significantly and consistently outperforms 11 comparison methods with the relative percentage of improvements up to 12.9%, and training time decreased by 14%.
Researcher Affiliation	Academia	1Software College, Northeastern University, China 2School of Computer Science and Engineering, Northeastern University, China EMAIL, EMAIL, EMAIL
Pseudocode	No	The paper describes the proposed method in sections like 'Prompt Decomposition', 'Prompt Fusion', 'Multi-Space Projection', and 'Reconstructed Prompt' using natural language and mathematical equations. There are no explicit sections or figures labeled 'Pseudocode' or 'Algorithm'.
Open Source Code	No	The paper does not contain an explicit statement about releasing source code for the methodology described, nor does it provide any links to a code repository.
Open Datasets	Yes	We conducted multi-angle experiments on the EPT method to demonstrate its outstanding applicability to 13 publicly available NLP tasks (8 from the GLUE benchmark 1 and 5 from the Super GLUE benchmark 2). Specifically, (1) GLUE (Wang et al. 2018) is a benchmark for evaluating natural language understanding performance. It consists of diverse tasks that test the model s ability to understand language in different contexts. To fully prove the performance effect of EPT, we maintain consistency with previous work, and the NLP datasets are MNLI (Williams, Nangia, and Bowman 2018), QQP (Wang et al. 2018), QNLI (Rajpurkar et al. 2016), SST-2 (Socher et al. 2013), STS-B (Cer et al. 2017), MRPC (Dolan and Brockett 2005), RTE (Giampiccolo et al. 2007) and Co LA (Warstadt, Singh, and Bowman 2019) from GLUE. (2) Super GLUE (Wang et al. 2019) is an extension of GLUE, that includes more complex and challenging tasks. This paper uses five tasks from Super GLUE: Multi RC (Khashabi et al. 2018), Bool Q (Clark et al. 2019), Wi C (Pilehvar and Camacho-Collados 2019), WSC (Levesque, Davis, and Morgenstern 2012) and CB (De Marneffe, Simons, and Tonhauser 2019). We follow the previous working setup (Su et al. 2022; Asai et al. 2022; Shi and Lipani 2024), which only utilizes Re Co RD (Zhang et al. 2018) and SQu AD (Rajpurkar et al. 2016) in the fewshot experiment.
Dataset Splits	Yes	We evaluate the performance of EPT, vanilla PT, and MPT in k-shot (k = 4, 16, 32) on the GLUE benchmark. As shown in Figure. 4(a), the performance improvement of EPT is mainly due to using the PETL framework for pre-training source prompts. We conducted multi-angle experiments on the EPT method to demonstrate its outstanding applicability to 13 publicly available NLP tasks (8 from the GLUE benchmark 1 and 5 from the Super GLUE benchmark 2).
Hardware Specification	No	To reduce GPU memory usage, we employed quantization techniques (Dettmers et al. 2021, 2023) for models with a size of 3B or larger. This process involves rescaling the input tensors by loading the model in 4-bit precision and backquantizing the values to bf16 during training. We minimize storage consumption by implementing the double quantization method proposed in QLo RA (Dettmers et al. 2023), which approach significantly reduces memory usage while maintaining performance comparable to standard parameterefficient fine-tuning. Notably, weight gradients are still calculated exclusively on the soft prompt parameters.
Software Dependencies	No	The paper describes using quantization techniques proposed in QLo RA (Dettmers et al. 2023) and refers to various models (T5-Base, T5-3B, T5-11B, Llama2-7B), but does not provide specific version numbers for any software dependencies like Python, PyTorch, TensorFlow, or CUDA.
Experiment Setup	Yes	The main experiments of EPT and baseline are performed using the T5-Base model (Shi and Lipani 2024), which has a parameter size of 220M and the hidden size d is 768. Consistent with the experimental setup of DEPT, we decompose the vanilla prompt (parameter size is 76,800) with the length of prompt tokens of 100. We train for 30,000 steps on small datasets with less than 100k training examples and 300,000 steps on large-size data with more than 100k examples. The batch size is 16 and the number of spaces is 4. For soft prompts, we search for learning rate within the set {3e-1, 4e-1, 5e-1}; for the low-rank matrices, we search for learning rate within the set {1e-04, 5e-4, 5e-03}.