Streamlining Redundant Layers to Compress Large Language Models

Authors: Xiaodong Chen, Yuxuan Hu, Jing Zhang, Yanling Wang, Cuiping Li, Hong Chen

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments show that LLM-Streamline outperforms both previous and concurrent state-of-the-art pruning methods in terms of both performance and training efficiency. Our code is available at this repository.
Researcher Affiliation Academia 1 Engineering Research Center of Database and Business Intelligence, MOE, China 2 School of Information, Renmin University of China,Beijing, China 3 Key Laboratory of Data Engineering and Knowledge Engineering, MOE, China 4 Zhongguancun Laboratory, China EMAIL EMAIL
Pseudocode No The paper describes the workflow of LLM-Streamline in Section 2, detailing layer pruning and layer replacement, but does not present it in a structured pseudocode or algorithm block.
Open Source Code Yes Our code is available at this repository.
Open Datasets Yes We conduct experiments on 12 well-known classification benchmarks and 3 generation benchmarks. Our results show that for an LLM with 7B or 13B parameters and a 25% pruning rate, we can maintain 93% performance in classification tasks and 77% in generation tasks without requiring a lot of training data, outperforming existing SOTA pruning methods.
Dataset Splits Yes We randomly sample the data based on the distribution used by Sheared LLa Ma (Xia et al., 2023), finally constructing the dataset containing 30,000 pieces of data. We randomly select 500 samples from this dataset and input them into LLMs, generating Fig. 2, and use these 500 data samples for layer pruning. All 30,000 pieces of data are used to train the lightweight network.
Hardware Specification Yes On a single A800 GPU, the training duration for the lightweight network is approximate 5 hours (for the Transformer layer).
Software Dependencies No The paper mentions using language models and training processes but does not specify software dependencies with version numbers, such as PyTorch, TensorFlow, or Python versions.
Experiment Setup Yes For both the FFN structure and the Swi GLU structure, the learning rate is set to 1e-3 and the weight decay is 1e-4. For the Transformer layer, the learning rate is set to 1e-5 and the weight decay is 1e-3. The model is trained using a batch size of 32 over 20 epochs. ... For layer replacement, in order to have a fairer comparison with Lo RA, we conduct one epoch of post-training with a learning rate of 5e-5, a weight decay of 1e-3, and a batch size of 32. ... For Lo RA, we set the rank to 128