Instruction-Following Pruning for Large Language Models

Authors: Bairu Hou, Qibin Chen, Jianyu Wang, Guoli Yin, Chong Wang, Nan Du, Ruoming Pang, Shiyu Chang, Tao Lei

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimental results demonstrate the effectiveness of our approach on a wide range of evaluation benchmarks. For example, our 3B activated model improves over the 3B dense model by 5-8 points of absolute margin on domains such as math and coding, and rivals the performance of a 9B model.
Researcher Affiliation Collaboration Work done while interning at Apple. 1Apple AI/ML 2UC Santa Barbara. Correspondence to: Bairu Hou <EMAIL>, Tao Lei <tao EMAIL>.
Pseudocode No The paper describes the architecture and training method in text and equations (Section 3.1, 3.2, 3.3) but does not provide a formal pseudocode or algorithm block.
Open Source Code No The paper states: We use the AXLearn (Apple, 2023) framework and JAX (Bradbury et al., 2018) for model training. The URL provided (https://github.com/apple/axlearn) is for the AXLearn framework, not for the specific implementation of IFPRUNING described in this paper. There is no explicit statement or link indicating that the source code for IFPRUNING is made available.
Open Datasets Yes We also follow the setup used in TULU 2 and sample additional 800K examples from the FLAN-V2 collection (Chung et al., 2024) to enhance task prompt diversity. ... We include IFEval (Zhou et al., 2023), Alpaca Eval 2.0 (Dubois et al., 2024), and Arena-Hard-Auto (Li et al., 2024) for evaluation. We evaluate the pass@1 performance on Human Eval-python (Chen et al., 2021), mbpp (Austin et al., 2021), and Multi PL-E (Cassano et al., 2022).
Dataset Splits No Our models are trained on an internal SFT dataset with several million examples. We also follow the setup used in TULU 2 and sample additional 800K examples from the FLAN-V2 collection (Chung et al., 2024) to enhance task prompt diversity. The paper describes the sources of training and evaluation data but does not specify the explicit training/validation/test splits for its internal dataset or how the sampled FLAN-V2 data was partitioned, nor does it explicitly state the use of standard splits for other evaluation benchmarks.
Hardware Specification Yes We evaluate the latency on a single NVIDIA RTX A6000 GPU and report time-to-first-token (TTFT) and decoding time with input length = 4k, generation length = 100, and sample 4 responses for each query.
Software Dependencies Yes We use the AXLearn (Apple, 2023) framework and JAX (Bradbury et al., 2018) for model training. The citation for JAX explicitly states 'Jax: composable transformations of python+ numpy programs, v0. 3.13, 2018'.
Experiment Setup Yes All models are pretrained with a batch size of 2048 and a total number of 5T tokens, except that the DENSE-3B is trained for 9T tokens. The SFT training for the baselines and our method is performed with a batch size of 1024 for 60k training steps.