reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Prompt-based Depth Pruning of Large Language Models

Authors: Juyun Wee, Minjae Park, Jaeho Lee

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Empirical results on commonsense reasoning benchmarks demonstrate that Pu DDing effectively accelerates the inference language models, and achieves better on-task performance than static depth pruning baselines. ... Empirically, we find that the proposed Pu DDing enjoys a clear advantage over static depth pruning algorithms, achieving more than 4%p accuracy increase on zero-shot commonsense reasoning tasks (Section 6).
Researcher Affiliation	Academia	Juyun Wee * 1 Minjae Park * 1 Jaeho Lee 1 1POSTECH. Correspondence to: Jaeho Lee <EMAIL>.
Pseudocode	No	The paper describes the method in Section 5 ('To this end, we propose a training-based method for the prompt-based depth pruning of large langauge models (Section 5). Our method, coined Prompt-routed Dynamic Depth Pruning (Pu DDing), works in two steps.'), but does not include any explicit pseudocode blocks or algorithms.
Open Source Code	Yes	Project Page: jwee01.github.io/Pu DDing Code: github.com/tada0347/Pu DDing
Open Datasets	Yes	We evaluate on the test splits of six zero-shot commonsense reasoning tasks: ARC-Challenge and ARC-Easy (Clark et al., 2018), Hella Swag (Zellers et al., 2019), PIQA (Bisk et al., 2020), Wino Grande (Sakaguchi et al., 2020), and Bool Q (Clark et al., 2019). ... For SLEB, FLAP, and Slice GPT, we have used the Wiki Text-2 (Merity et al., 2022). For Shortened LLa MA, we have used the Book Corpus (Zhu et al., 2015). ... For training Lo RA weights, we have followed the setup and hyperparameters used for Lo RA training in shortened LLa MA (Kim et al., 2024); we have used Alpaca dataset (Taori et al., 2023) for training, as in the paper. ... including Open Book QA (Mihaylov et al., 2018), Math QA (Amini et al., 2019), and MMLU (Hendrycks et al., 2021).
Dataset Splits	Yes	We evaluate on the test splits of six zero-shot commonsense reasoning tasks: ARC-Challenge and ARC-Easy (Clark et al., 2018), Hella Swag (Zellers et al., 2019), PIQA (Bisk et al., 2020), Wino Grande (Sakaguchi et al., 2020), and Bool Q (Clark et al., 2019). ... To generate the candidate omission set for our algorithm, we have used 128 randomly drawn samples from the training splits of five zero-shot commonsense reasoning tasks: ARC-Challenge, ARC-Easy, Hella Swag, PIQA, and Wino Grande. ... For training the router, we have used the full training splits. Bool Q dataset has been left out in order to evaluate the generalization to unseen sets.
Hardware Specification	Yes	We have mainly used NVIDIA RTX 6000 Ada for evaluation and training. In addition, we have used cloud instances of NVIDIA A100 for evaluation. ... Table 7 presents results on edge devices (e.g., Apple M3 Pro), showing consistent speedup.
Software Dependencies	No	We use a lightweight transformerbased encoder as our router. More specifically, we insert a single linear layer on pretrained BERT-base (Devlin et al., 2019), and jointly fine-tune during the training. While this router has more parameters ( 110M) than typical routers that are used for dynamic token routing such as D-LLM (Wang et al., 2024) which uses 2-layer MLP the computational cost is bearable as we route only once per prompt. ... The router has been trained with Adam W with learning rate 10 5, weight decay 0.01, and batch size 32 for 10 epochs, with 500 warm-up steps. The paper mentions specific models like BERT-base and optimizers like Adam W, but does not provide version numbers for general software dependencies such as Python, PyTorch/TensorFlow, or CUDA.
Experiment Setup	Yes	To generate the candidate omission set for our algorithm, we have used 128 randomly drawn samples from the training splits of five zero-shot commonsense reasoning tasks: ARC-Challenge, ARC-Easy, Hella Swag, PIQA, and Wino Grande. That is, we use total 10 omission sets (as we use two different losses). For training the router, we have used the full training splits. Bool Q dataset has been left out in order to evaluate the generalization to unseen sets. The router has been trained with Adam W with learning rate 10 5, weight decay 0.01, and batch size 32 for 10 epochs, with 500 warm-up steps. ... For Pu DDing, we have pruned seven layers (21.88% sparsity).