Adaptive Pruning of Pretrained Transformer via Differential Inclusions

Authors: yizhuo Ding, Ke Fan, Yikai Wang, Xinwei Sun, Yanwei Fu

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments conducted on various well-known transformer backbones have demonstrated the efficacy of SPP. Our code is available at https://github.com/yizhuo Di/Solution-Path-Pruning.
Researcher Affiliation Academia Yizhuo Ding, Ke Fan, Yikai Wang Q, Xinwei Sun Q, Yanwei Fu School of Data Science, Fudan University EMAIL; EMAIL; EMAIL QCorresponding authors. Prof. Yanwei Fu is also with Fudan ISTBI ZJNU Algorithm Centre for Braininspired Intelligence, Zhejiang Normal University, Jinhua, China.
Pseudocode Yes Algorithm 1 Transformer Weight Family Algorithm 2 Extention to LLMs
Open Source Code Yes Our code is available at https://github.com/yizhuo Di/Solution-Path-Pruning.
Open Datasets Yes In this paper, we applied SPP to the classification task dataset Image Net (Deng et al., 2009) using the Dei T (Touvron et al., 2021a) backbone. Furthermore, we also extended the method to image and text retrieval datasets, using CLIP models (Radford et al., 2021). ... We used 4 A100 GPU with memory size of 80GB for those experiments. The search stage contain an update stage and prune stage, both just need to run once for one model. All the finetune stage used Adam W as the optimizer and consine scheduler as the learning rate scheduler. The hyperparameters are listed in Table. 6 and Table. 8. ... We applied our method on Llama2-7B and OPT-6.7b. The calibration datasets C4 and Wikitext2 were used to generate activations during the forward pass, which, along with weight magnitude, served as the pruning metric. The results were reported on 5 datasets.
Dataset Splits No The paper mentions datasets like Image Net-1k, COCO, CIFAR-10, C4, and Wikitext2, but does not provide specific details on how these datasets were split into training, validation, and test sets (e.g., exact percentages, sample counts, or explicit references to standard split methodologies with citations).
Hardware Specification Yes We used 4 A100 GPU with memory size of 80GB for those experiments.
Software Dependencies No The paper mentions using "Adam W as the optimizer and consine scheduler as the learning rate scheduler" but does not specify any software names with version numbers (e.g., Python, PyTorch, TensorFlow versions).
Experiment Setup Yes The hyperparameters are listed in Table. 6 and Table. 8. Table 6: The hyperparameters of experiments mentioned above. Model Datasets Updating Epochs Pruning Epochs Finetuning Epochs Batch Size LR Dei T-Base Imagenet-1k 5 20 300 1024 8e-4 Dei T-Small Imagenet-1k 10 30 300 512 8e-4 Dei T-Tiny Imagenet-1k 10 30 300 256 8e-4 Swin-Tiny Imagenet-1k 5 20 300 256 8e-4 CLip-Large COCO 1 5 5 32 1e-5 CLip-Base COCO 3 5 5 32 1e-5