Liger: Linearizing Large Language Models to Gated Recurrent Structures
Authors: Disen Lan, Weigao Sun, Jiaxi Hu, Jusen Du, Yu Cheng
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this section, we conduct extensive experiments to answer the following research questions (RQ): RQ1: Can Liger linearize the pre-trained LLMs and recover performance more effectively compared with other linearization methods? ... We conducted experiments to compare efficiency in terms of decoding latency speed and GPU memory consumption of Llama-3-8B without (w/o.) Flash-Attention-2 (FA2), Llama3-8B with (w/.) Flash-Attention-2 (FA2) and Liger-GLA-8B on single A800 80GB GPU. |
| Researcher Affiliation | Collaboration | 1Shanghai AI Laboratory 2South China University of Technology 3The Hong Kong University of Science and Technology (Guangzhou) 4Nanjing University 5The Chinese University of Hong Kong. * Interns at Shanghai AI Laboratory. Corresponding Authors: Weigao Sun <EMAIL>, Yu Cheng <EMAIL>. |
| Pseudocode | No | The paper only presents mathematical formulations (Eq. 1-11) and describes procedures in narrative text; it does not contain structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | The source code is available at https://github.com/ Open Sparse LLMs/Linearization and the models are available at https://huggingface.co/collections/ linear-moe-hub. |
| Open Datasets | Yes | We use 50,000 high quality instruction samples of cleaned Alpaca dataset (Taori et al., 2023) during linearization process to improve instruction-following ability and recover LLM performance in language modeling tasks. ... including Pi QA (Bisk et al., 2020), ARC-easy (ARC-e), ARC-challenge (ARC-c) (Clark et al., 2018), Hella Swag (Hella.) (Zellers et al., 2019), Wino Grande (Wino.) (Sakaguchi et al., 2019) and MMLU (Li et al., 2023). |
| Dataset Splits | No | The paper states 'We use 50,000 high quality instruction samples of cleaned Alpaca dataset (Taori et al., 2023) during linearization process' and 'the finetuning epochs is 2, which means we only use 100,000 cleaned Alpaca instruction samples (around 0.02B tokens) for gate reccurrent model linearization.' This indicates the total amount of data used but does not provide specific training/validation/test splits for their fine-tuning process. For evaluation, it refers to standard benchmarks without explicitly detailing the splits used for those benchmarks within the paper itself. |
| Hardware Specification | Yes | All experiments are implemented in Py Torch and conducted on single NVIDIA A800 80GB GPU. |
| Software Dependencies | No | All experiments are implemented in Py Torch and conducted on single NVIDIA A800 80GB GPU. The paper only mentions PyTorch without a specific version number and does not list any other software dependencies with version numbers. |
| Experiment Setup | Yes | We opt for Adam W optimizer with a learining rate of 1e 3. By default, the Lo RA rank is set to 8 and alpha is set to 8. The finetuning epochs is 2, which means we only use 100,000 cleaned Alpaca instruction samples (around 0.02B tokens) for gate reccurrent model linearization. We pad the input sequence to 1024 tokens with mini batch size of 1, and set the global batch size to 8 by gradient accumulaltion, following the settings in Lo LCATs (Zhang et al., 2024a). |