PAT: Pruning-Aware Tuning for Large Language Models

Authors: Yijiang Liu, Huanrui Yang, Youxin Chen, Rongyu Zhang, Miao Wang, Yuan Du, Li Du

AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments demonstrate that PAT excels in both performance and efficiency. For example, our Llama2-7b model with a 25% pruning ratio achieves 1.33 speedup while outperforming the Lo RA-finetuned model by up to 1.26% in accuracy with a similar training cost. In this section, we present the experimental results and analysis. We begin by describing the experimental setup. Next, we showcase our main results across various Language Models (LLMs). We then delve into the efficiency and accuracy trade-off, examining memory and latency considerations. Finally, we conduct ablation studies on the trainable mask and identity loss.
Researcher Affiliation Collaboration 1School of Electronic Science and Engineering, Nanjing University 2University of Arizona 3Samsung Electronic Research Centre of China 4Interdisciplinary Research Center for Future Intelligent Chips, Nanjing University, Suzhou EMAIL EMAIL, EMAIL, EMAIL
Pseudocode No The paper describes its methodology using mathematical formulations (Equations 1-12) and architectural diagrams (Figure 2), but it does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code Yes Code https://github.com/kriskrisliu/PAT
Open Datasets Yes We employ the La Mini-instruction dataset (Wu et al. 2023) for fine-tuning. ... We conduct zero-shot evaluation on 14 datasets, including ARCChallenge (Clark et al. 2018), ARC-Easy (Clark et al. 2018), BOOLQ (Wang et al. 2019a), COPA (Wang et al. 2019a), Hella Swag (Zellers et al. 2019), MMLU (Hendrycks et al. 2021), Multi RC (Wang et al. 2019a), Open Book QA (Mihaylov et al. 2018), PIQA (Bisk et al. 2020), RTE (Wang et al. 2019a), SIQA (Sap et al. 2019), WIC (Wang et al. 2019a), Wino Grande (Sakaguchi et al. 2021), WSC (Wang et al. 2019a).
Dataset Splits Yes We employ the La Mini-instruction dataset (Wu et al. 2023) for fine-tuning. To reduce training costs, we randomly drop 50% of the samples, resulting in a final dataset of 1 million samples. Unless otherwise stated, all experimental results are based on this setting. We conduct zero-shot evaluation on 14 datasets...
Hardware Specification Yes Experiments are conducted using A100 GPUs. ... The base Llama2 13B model encounters Out-Of-Memory (OOM) errors at a batch size of larger than 288 when executed on a single A100-80GB GPU.
Software Dependencies No The paper mentions using "model frameworks and checkpoints from Hugging Face (Jain 2022; Wolf et al. 2019)" but does not provide specific version numbers for Hugging Face libraries, Python, PyTorch, CUDA, or other critical software components.
Experiment Setup Yes The models are fine-tuned over 3 epochs using the Alpaca instruction template. The learning rate is set to 5 10 5 with a cosine schedule. The batch size is set to 128, and the sequence length is 256 tokens. The milestone step of our PAT, s0, is set to 1/3 of the total training steps.