reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

LongMamba: Enhancing Mamba's Long-Context Capabilities via Training-Free Receptive Field Enlargement

Authors: Zhifan Ye, Kejing Xia, Yonggan Fu, Xin Dong, Jihoon Hong, Xiangchi Yuan, Shizhe Diao, Jan Kautz, Pavlo Molchanov, Yingyan Celine Lin

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	6 EXPERIMENTAL RESULTS In this section, we conduct a comprehensive evaluation of Long Mamba across diverse tasks to assess its long-context understanding capabilities. Our evaluation covers three distinct datasets: language modeling (on PG-19 (Rae et al., 2019)), RULER (Hsieh et al., 2024), and Long Bench-E (Bai et al., 2023).
Researcher Affiliation	Collaboration	Zhifan Ye1 , Kejing Xia1 , Yonggan Fu1,2, Xin Dong2, Jihoon Hong1, Xiangchi Yuan1, Shizhe Diao2, Jan Kautz2, Pavlo Molchanov2, Yingyan (Celine) Lin1,2 1Georgia Institute of Technology 2NVIDIA
Pseudocode	No	The paper describes a two-step pipeline in Section 5 and explains the steps using formal equations and text, but it does not present structured pseudocode or algorithm blocks.
Open Source Code	Yes	Our code is available at https://github.com/GATECH-EIC/Long Mamba.
Open Datasets	Yes	Our evaluation covers three distinct datasets: language modeling (on PG-19 (Rae et al., 2019)), RULER (Hsieh et al., 2024), and Long Bench-E (Bai et al., 2023). ... when constructing the lookup table g(S) used in Sec. 5.2, we first randomly sample 5 sequences from the Pile (Gao et al., 2020) dataset to calibrate the t distribution.
Dataset Splits	No	The paper uses well-known datasets and benchmarks (PG-19, RULER, Long Bench-E) and mentions generating 100 sequences for RULER. However, it does not explicitly provide specific training/test/validation dataset splits (e.g., percentages, counts, or direct links to split files) within the text for all experiments. For PG-19, it defers to another paper's settings: 'following the settings in (Ben-Kish et al., 2024)'.
Hardware Specification	Yes	Table 6: Comparison of the prefilling latency on A5000 between the vanilla models (Gu & Dao, 2023; Dao & Gu, 2024a; Glorioso et al., 2024a) and the corresponding Long Mamba-enhanced models across various sequence lengths (4k, 8k, 16k, 24k, 32k, and 40k tokens). The batch size for all experiments is set to 1. The unit of all latency measurements is seconds.
Software Dependencies	No	The paper does not explicitly provide specific software dependencies (e.g., library or solver names with version numbers) needed to replicate the experiment.
Experiment Setup	Yes	For the hyperparameter θ, which differentiates the global and local channels in Sec. 5.1, we conduct a hyperparameter search among candidate values {10 40, 10 30, 10 20, 10 10, 10 5, 10 4, 10 3, 10 2, 5 10 2, 10 1, 5 10 1} on the Long Bench-E dataset and select the θ that yields the highest average accuracy. For each model, we use the same θ across all experiments. Specifically, for Mamba-1.4B, θ is set to 10 30, while for Mamba2-1.3B and Zamba2-1.2B, we set θ to 5 10 2 and 10 5, respectively. ... To ensure numerical stability, we clamp extreme values of t to the top C% largest values. We search for the optimal C among candidates {0, 5, 10, 15, 20} following the same procedure as for θ. As a result, C is set to 5 for both Mamba2-1.3B and Zamba2-1.2B and 20 for Mamba-1.4B.