Prompt-based Depth Pruning of Large Language Models
Authors: Juyun Wee, Minjae Park, Jaeho Lee
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Empirical results on commonsense reasoning benchmarks demonstrate that Pu DDing effectively accelerates the inference language models, and achieves better on-task performance than static depth pruning baselines. ... Empirically, we find that the proposed Pu DDing enjoys a clear advantage over static depth pruning algorithms, achieving more than 4%p accuracy increase on zero-shot commonsense reasoning tasks (Section 6). |
| Researcher Affiliation | Academia | Juyun Wee * 1 Minjae Park * 1 Jaeho Lee 1 1POSTECH. Correspondence to: Jaeho Lee <EMAIL>. |
| Pseudocode | No | The paper describes the method in Section 5 ('To this end, we propose a training-based method for the prompt-based depth pruning of large langauge models (Section 5). Our method, coined Prompt-routed Dynamic Depth Pruning (Pu DDing), works in two steps.'), but does not include any explicit pseudocode blocks or algorithms. |
| Open Source Code | Yes | Project Page: jwee01.github.io/Pu DDing Code: github.com/tada0347/Pu DDing |
| Open Datasets | Yes | We evaluate on the test splits of six zero-shot commonsense reasoning tasks: ARC-Challenge and ARC-Easy (Clark et al., 2018), Hella Swag (Zellers et al., 2019), PIQA (Bisk et al., 2020), Wino Grande (Sakaguchi et al., 2020), and Bool Q (Clark et al., 2019). ... For SLEB, FLAP, and Slice GPT, we have used the Wiki Text-2 (Merity et al., 2022). For Shortened LLa MA, we have used the Book Corpus (Zhu et al., 2015). ... For training Lo RA weights, we have followed the setup and hyperparameters used for Lo RA training in shortened LLa MA (Kim et al., 2024); we have used Alpaca dataset (Taori et al., 2023) for training, as in the paper. ... including Open Book QA (Mihaylov et al., 2018), Math QA (Amini et al., 2019), and MMLU (Hendrycks et al., 2021). |
| Dataset Splits | Yes | We evaluate on the test splits of six zero-shot commonsense reasoning tasks: ARC-Challenge and ARC-Easy (Clark et al., 2018), Hella Swag (Zellers et al., 2019), PIQA (Bisk et al., 2020), Wino Grande (Sakaguchi et al., 2020), and Bool Q (Clark et al., 2019). ... To generate the candidate omission set for our algorithm, we have used 128 randomly drawn samples from the training splits of five zero-shot commonsense reasoning tasks: ARC-Challenge, ARC-Easy, Hella Swag, PIQA, and Wino Grande. ... For training the router, we have used the full training splits. Bool Q dataset has been left out in order to evaluate the generalization to unseen sets. |
| Hardware Specification | Yes | We have mainly used NVIDIA RTX 6000 Ada for evaluation and training. In addition, we have used cloud instances of NVIDIA A100 for evaluation. ... Table 7 presents results on edge devices (e.g., Apple M3 Pro), showing consistent speedup. |
| Software Dependencies | No | We use a lightweight transformerbased encoder as our router. More specifically, we insert a single linear layer on pretrained BERT-base (Devlin et al., 2019), and jointly fine-tune during the training. While this router has more parameters ( 110M) than typical routers that are used for dynamic token routing such as D-LLM (Wang et al., 2024) which uses 2-layer MLP the computational cost is bearable as we route only once per prompt. ... The router has been trained with Adam W with learning rate 10 5, weight decay 0.01, and batch size 32 for 10 epochs, with 500 warm-up steps. The paper mentions specific models like BERT-base and optimizers like Adam W, but does not provide version numbers for general software dependencies such as Python, PyTorch/TensorFlow, or CUDA. |
| Experiment Setup | Yes | To generate the candidate omission set for our algorithm, we have used 128 randomly drawn samples from the training splits of five zero-shot commonsense reasoning tasks: ARC-Challenge, ARC-Easy, Hella Swag, PIQA, and Wino Grande. That is, we use total 10 omission sets (as we use two different losses). For training the router, we have used the full training splits. Bool Q dataset has been left out in order to evaluate the generalization to unseen sets. The router has been trained with Adam W with learning rate 10 5, weight decay 0.01, and batch size 32 for 10 epochs, with 500 warm-up steps. ... For Pu DDing, we have pruned seven layers (21.88% sparsity). |