Streamlining Redundant Layers to Compress Large Language Models
Authors: Xiaodong Chen, Yuxuan Hu, Jing Zhang, Yanling Wang, Cuiping Li, Hong Chen
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments show that LLM-Streamline outperforms both previous and concurrent state-of-the-art pruning methods in terms of both performance and training efficiency. Our code is available at this repository. |
| Researcher Affiliation | Academia | 1 Engineering Research Center of Database and Business Intelligence, MOE, China 2 School of Information, Renmin University of China,Beijing, China 3 Key Laboratory of Data Engineering and Knowledge Engineering, MOE, China 4 Zhongguancun Laboratory, China EMAIL EMAIL |
| Pseudocode | No | The paper describes the workflow of LLM-Streamline in Section 2, detailing layer pruning and layer replacement, but does not present it in a structured pseudocode or algorithm block. |
| Open Source Code | Yes | Our code is available at this repository. |
| Open Datasets | Yes | We conduct experiments on 12 well-known classification benchmarks and 3 generation benchmarks. Our results show that for an LLM with 7B or 13B parameters and a 25% pruning rate, we can maintain 93% performance in classification tasks and 77% in generation tasks without requiring a lot of training data, outperforming existing SOTA pruning methods. |
| Dataset Splits | Yes | We randomly sample the data based on the distribution used by Sheared LLa Ma (Xia et al., 2023), finally constructing the dataset containing 30,000 pieces of data. We randomly select 500 samples from this dataset and input them into LLMs, generating Fig. 2, and use these 500 data samples for layer pruning. All 30,000 pieces of data are used to train the lightweight network. |
| Hardware Specification | Yes | On a single A800 GPU, the training duration for the lightweight network is approximate 5 hours (for the Transformer layer). |
| Software Dependencies | No | The paper mentions using language models and training processes but does not specify software dependencies with version numbers, such as PyTorch, TensorFlow, or Python versions. |
| Experiment Setup | Yes | For both the FFN structure and the Swi GLU structure, the learning rate is set to 1e-3 and the weight decay is 1e-4. For the Transformer layer, the learning rate is set to 1e-5 and the weight decay is 1e-3. The model is trained using a batch size of 32 over 20 epochs. ... For layer replacement, in order to have a fairer comparison with Lo RA, we conduct one epoch of post-training with a learning rate of 5e-5, a weight decay of 1e-3, and a batch size of 32. ... For Lo RA, we set the rank to 128 |