LongMamba: Enhancing Mamba's Long-Context Capabilities via Training-Free Receptive Field Enlargement
Authors: Zhifan Ye, Kejing Xia, Yonggan Fu, Xin Dong, Jihoon Hong, Xiangchi Yuan, Shizhe Diao, Jan Kautz, Pavlo Molchanov, Yingyan Celine Lin
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | 6 EXPERIMENTAL RESULTS In this section, we conduct a comprehensive evaluation of Long Mamba across diverse tasks to assess its long-context understanding capabilities. Our evaluation covers three distinct datasets: language modeling (on PG-19 (Rae et al., 2019)), RULER (Hsieh et al., 2024), and Long Bench-E (Bai et al., 2023). |
| Researcher Affiliation | Collaboration | Zhifan Ye1 , Kejing Xia1 , Yonggan Fu1,2, Xin Dong2, Jihoon Hong1, Xiangchi Yuan1, Shizhe Diao2, Jan Kautz2, Pavlo Molchanov2, Yingyan (Celine) Lin1,2 1Georgia Institute of Technology 2NVIDIA |
| Pseudocode | No | The paper describes a two-step pipeline in Section 5 and explains the steps using formal equations and text, but it does not present structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | Our code is available at https://github.com/GATECH-EIC/Long Mamba. |
| Open Datasets | Yes | Our evaluation covers three distinct datasets: language modeling (on PG-19 (Rae et al., 2019)), RULER (Hsieh et al., 2024), and Long Bench-E (Bai et al., 2023). ... when constructing the lookup table g(S) used in Sec. 5.2, we first randomly sample 5 sequences from the Pile (Gao et al., 2020) dataset to calibrate the t distribution. |
| Dataset Splits | No | The paper uses well-known datasets and benchmarks (PG-19, RULER, Long Bench-E) and mentions generating 100 sequences for RULER. However, it does not explicitly provide specific training/test/validation dataset splits (e.g., percentages, counts, or direct links to split files) within the text for all experiments. For PG-19, it defers to another paper's settings: 'following the settings in (Ben-Kish et al., 2024)'. |
| Hardware Specification | Yes | Table 6: Comparison of the prefilling latency on A5000 between the vanilla models (Gu & Dao, 2023; Dao & Gu, 2024a; Glorioso et al., 2024a) and the corresponding Long Mamba-enhanced models across various sequence lengths (4k, 8k, 16k, 24k, 32k, and 40k tokens). The batch size for all experiments is set to 1. The unit of all latency measurements is seconds. |
| Software Dependencies | No | The paper does not explicitly provide specific software dependencies (e.g., library or solver names with version numbers) needed to replicate the experiment. |
| Experiment Setup | Yes | For the hyperparameter θ, which differentiates the global and local channels in Sec. 5.1, we conduct a hyperparameter search among candidate values {10 40, 10 30, 10 20, 10 10, 10 5, 10 4, 10 3, 10 2, 5 10 2, 10 1, 5 10 1} on the Long Bench-E dataset and select the θ that yields the highest average accuracy. For each model, we use the same θ across all experiments. Specifically, for Mamba-1.4B, θ is set to 10 30, while for Mamba2-1.3B and Zamba2-1.2B, we set θ to 5 10 2 and 10 5, respectively. ... To ensure numerical stability, we clamp extreme values of t to the top C% largest values. We search for the optimal C among candidates {0, 5, 10, 15, 20} following the same procedure as for θ. As a result, C is set to 5 for both Mamba2-1.3B and Zamba2-1.2B and 20 for Mamba-1.4B. |