reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Liger: Linearizing Large Language Models to Gated Recurrent Structures

Authors: Disen Lan, Weigao Sun, Jiaxi Hu, Jusen Du, Yu Cheng

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this section, we conduct extensive experiments to answer the following research questions (RQ): RQ1: Can Liger linearize the pre-trained LLMs and recover performance more effectively compared with other linearization methods? ... We conducted experiments to compare efficiency in terms of decoding latency speed and GPU memory consumption of Llama-3-8B without (w/o.) Flash-Attention-2 (FA2), Llama3-8B with (w/.) Flash-Attention-2 (FA2) and Liger-GLA-8B on single A800 80GB GPU.
Researcher Affiliation	Collaboration	1Shanghai AI Laboratory 2South China University of Technology 3The Hong Kong University of Science and Technology (Guangzhou) 4Nanjing University 5The Chinese University of Hong Kong. * Interns at Shanghai AI Laboratory. Corresponding Authors: Weigao Sun <EMAIL>, Yu Cheng <EMAIL>.
Pseudocode	No	The paper only presents mathematical formulations (Eq. 1-11) and describes procedures in narrative text; it does not contain structured pseudocode or algorithm blocks.
Open Source Code	Yes	The source code is available at https://github.com/ Open Sparse LLMs/Linearization and the models are available at https://huggingface.co/collections/ linear-moe-hub.
Open Datasets	Yes	We use 50,000 high quality instruction samples of cleaned Alpaca dataset (Taori et al., 2023) during linearization process to improve instruction-following ability and recover LLM performance in language modeling tasks. ... including Pi QA (Bisk et al., 2020), ARC-easy (ARC-e), ARC-challenge (ARC-c) (Clark et al., 2018), Hella Swag (Hella.) (Zellers et al., 2019), Wino Grande (Wino.) (Sakaguchi et al., 2019) and MMLU (Li et al., 2023).
Dataset Splits	No	The paper states 'We use 50,000 high quality instruction samples of cleaned Alpaca dataset (Taori et al., 2023) during linearization process' and 'the finetuning epochs is 2, which means we only use 100,000 cleaned Alpaca instruction samples (around 0.02B tokens) for gate reccurrent model linearization.' This indicates the total amount of data used but does not provide specific training/validation/test splits for their fine-tuning process. For evaluation, it refers to standard benchmarks without explicitly detailing the splits used for those benchmarks within the paper itself.
Hardware Specification	Yes	All experiments are implemented in Py Torch and conducted on single NVIDIA A800 80GB GPU.
Software Dependencies	No	All experiments are implemented in Py Torch and conducted on single NVIDIA A800 80GB GPU. The paper only mentions PyTorch without a specific version number and does not list any other software dependencies with version numbers.
Experiment Setup	Yes	We opt for Adam W optimizer with a learining rate of 1e 3. By default, the Lo RA rank is set to 8 and alpha is set to 8. The finetuning epochs is 2, which means we only use 100,000 cleaned Alpaca instruction samples (around 0.02B tokens) for gate reccurrent model linearization. We pad the input sequence to 1024 tokens with mini batch size of 1, and set the global batch size to 8 by gradient accumulaltion, following the settings in Lo LCATs (Zhang et al., 2024a).