reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

MambaExtend: A Training-Free Approach to Improve Long Context Extension of Mamba

Authors: Seyedarmin Azizi, Souvik Kundu, Mohammad Sadeghi, Massoud Pedram

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	To show the efficacy of Mamba Extend, we performed extensive experiments on perplexity evaluation, Long Bench, and long-context retrieval tasks with both Mamba and Mamba2 variants. For example, on PG19, only via ZO-based scaling factor update, Mamba Extend can improve the context length extension ability of a pre-trained model from 2k to 64k, while not incurring any significant perplexity (PPL) increase. This section evaluates the performance and efficiency of our proposed Mamba Extend. Specifically, we first describe the models and datasets used for our experiments. We then present extensive empirical results to outline our findings regarding the long-context performance of the Mamba model variants. We finally discuss the compute, time, and memory requirements for Mamba Extend.
Researcher Affiliation	Collaboration	u University of Southern California, Los Angeles, USA i Intel Labs, USA Equal contribution authors EMAIL, EMAIL
Pseudocode	Yes	Algorithm 1 outlines the Mamba Extend framework, which takes a pre-trained Mamba model as input, along with a small set of calibration samples from the target task and a specialized function known as the calibration function (CF). Algorithm 2 CFBP Algorithm. Algorithm 3 CFZO Algorithm.
Open Source Code	Yes	Codes and checkpoints are available here1. 1https://github.com/Armin Azizi98/Long Context Mamba
Open Datasets	Yes	For long-context understanding, we use the Pile (Gao et al., 2020) and PG-19 (Rae et al., 2019) datasets and assess the performance of the Mamba Extend in terms of perplexity scores at various context lengths. Additionally, we use the Long Bench benchmark (Bai et al., 2023) to evaluate the performance accuracy of the Mamba-1.4B and Mamba2-780M models. For the passkey retrieval task, we follow the setup described in (Ben-Kish et al., 2024) and evaluate the performance of the Mamba-130M and Mamba-1.4B models in retrieving a 5-digit code embedded at a random sequence depth within samples from the Wiki Text-103 dataset (Merity et al., 2016).
Dataset Splits	Yes	To evaluate perplexity (PPL) on the Pile and PG-19, we use twenty calibration samples from the corresponding training set for a given context length. Due to the lack of training data, we used 10 samples from the 4K-8K split of each dataset as calibration data and the remaining samples from the same split to evaluate.
Hardware Specification	Yes	We used an Nvidia A6000 GPU with 48 GB memory for all the experiments.
Software Dependencies	No	To perform calibration and fine-tuning, we used Pytorch API to write the corresponding code.
Experiment Setup	Yes	CFZO hyperparameters. For Pile, PG-19, and Long Bench dataset calibration, we set the ZO optimization hyperparameters to η = 0.001, c = 0.1, and K = 50. CFBP hyperparameters. For the passkey retrieval task, we train the models for one epoch using Adam optimizer with a learning rate of 0.1 for Mamba Extend. For Deci Mamba, and full fine-tuning, we use the learning rate to be 1e 4, as suggested by the authors Ben-Kish et al. (2024). For all three cases, we use a batch size of 32, a gradient clipping of 1.0, a weight decay of 0.1, and train on a sequence length of 6144.