reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

PAT: Pruning-Aware Tuning for Large Language Models

Authors: Yijiang Liu, Huanrui Yang, Youxin Chen, Rongyu Zhang, Miao Wang, Yuan Du, Li Du

AAAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments demonstrate that PAT excels in both performance and efficiency. For example, our Llama2-7b model with a 25% pruning ratio achieves 1.33 speedup while outperforming the Lo RA-finetuned model by up to 1.26% in accuracy with a similar training cost. In this section, we present the experimental results and analysis. We begin by describing the experimental setup. Next, we showcase our main results across various Language Models (LLMs). We then delve into the efficiency and accuracy trade-off, examining memory and latency considerations. Finally, we conduct ablation studies on the trainable mask and identity loss.
Researcher Affiliation	Collaboration	1School of Electronic Science and Engineering, Nanjing University 2University of Arizona 3Samsung Electronic Research Centre of China 4Interdisciplinary Research Center for Future Intelligent Chips, Nanjing University, Suzhou EMAIL EMAIL, EMAIL, EMAIL
Pseudocode	No	The paper describes its methodology using mathematical formulations (Equations 1-12) and architectural diagrams (Figure 2), but it does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code	Yes	Code https://github.com/kriskrisliu/PAT
Open Datasets	Yes	We employ the La Mini-instruction dataset (Wu et al. 2023) for fine-tuning. ... We conduct zero-shot evaluation on 14 datasets, including ARCChallenge (Clark et al. 2018), ARC-Easy (Clark et al. 2018), BOOLQ (Wang et al. 2019a), COPA (Wang et al. 2019a), Hella Swag (Zellers et al. 2019), MMLU (Hendrycks et al. 2021), Multi RC (Wang et al. 2019a), Open Book QA (Mihaylov et al. 2018), PIQA (Bisk et al. 2020), RTE (Wang et al. 2019a), SIQA (Sap et al. 2019), WIC (Wang et al. 2019a), Wino Grande (Sakaguchi et al. 2021), WSC (Wang et al. 2019a).
Dataset Splits	Yes	We employ the La Mini-instruction dataset (Wu et al. 2023) for fine-tuning. To reduce training costs, we randomly drop 50% of the samples, resulting in a final dataset of 1 million samples. Unless otherwise stated, all experimental results are based on this setting. We conduct zero-shot evaluation on 14 datasets...
Hardware Specification	Yes	Experiments are conducted using A100 GPUs. ... The base Llama2 13B model encounters Out-Of-Memory (OOM) errors at a batch size of larger than 288 when executed on a single A100-80GB GPU.
Software Dependencies	No	The paper mentions using "model frameworks and checkpoints from Hugging Face (Jain 2022; Wolf et al. 2019)" but does not provide specific version numbers for Hugging Face libraries, Python, PyTorch, CUDA, or other critical software components.
Experiment Setup	Yes	The models are fine-tuned over 3 epochs using the Alpaca instruction template. The learning rate is set to 5 10 5 with a cosine schedule. The batch size is set to 128, and the sequence length is 256 tokens. The milestone step of our PAT, s0, is set to 1/3 of the total training steps.