Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1]

Bi-Mamba: Towards Accurate 1-Bit State Space Model

Authors: Shengkun Tang, Liqun Ma, Haonan Li, Mingjie Sun, Zhiqiang Shen

TMLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments on language modeling benchmarks demonstrate that Bi-Mamba achieves performance comparable to its full-precision (FP16 or BF16) counterparts, while outperforming post-training binarization (PTB) Mamba and binarization-aware training (BAT) Transformer baselines.
Researcher Affiliation Academia Shengkun Tang EMAIL Department of Machine Learning Mohamed bin Zayed University of Artificial Intelligence Liqun Ma EMAIL Department of Machine Learning Mohamed bin Zayed University of Artificial Intelligence Haonan Li EMAIL Department of Natural Language Processing Mohamed bin Zayed University of Artificial Intelligence Mingjie Sun EMAIL Department of Computer Science Carnegie Mellon University Zhiqiang Shen EMAIL Department of Machine Learning Mohamed bin Zayed University of Artificial Intelligence
Pseudocode No The paper describes mathematical equations and a "Overall Design of Bi-Mamba" section, but these are presented as mathematical formulas and descriptive text, not structured pseudocode or algorithm blocks.
Open Source Code Yes Code and the pre-trained weights are available at https://github.com/Tangshengku/Bi-Mamba.
Open Datasets Yes Following FBI-LLM, we train Bi-Mamba with the Amber dataset (Liu et al., 2023) which contains a total 1.26 Trillion tokens from Refined Web (Penedo et al., 2023), Star Coder (Li et al., 2023a), and Red Pajama-v1 (Computer, 2023). ... We also use perplexity on Wikitext2 (Merity et al., 2016), PTB (Marcus et al., 1993), C4 (Raffel et al., 2020) dataset as the evaluation metric. ... We further conduct the instruction tuning on our base Bi-Mamba model with Open Orca (Lian et al., 2023) dataset.
Dataset Splits No The data is partitioned into 360 chunks, each comprising approximately 3.5B tokens on average. This describes the partitioning of the training dataset, but the paper does not explicitly detail the exact training, validation, and test splits (e.g., percentages or specific subsets) used for model evaluation across all tasks. Standard evaluation datasets like Wiki2, PTB, C4 are mentioned, but not the specific splits used.
Hardware Specification Yes We train models until convergence with 32 NVIDIA A100 GPUs in total and maintain BF16 precision while training. ... We deploy our Bi-Mamba on the CPU of Mac M4 Pro Chip by implementing Bi-Mamba with llama.cpp framework.
Software Dependencies No The paper mentions using the 'lm-evaluation-harness package' and implementing Bi-Mamba with the 'llama.cpp framework', and utilizing the 'asitop tool', but does not provide specific version numbers for these software dependencies.
Experiment Setup Yes We train Bi-Mamba on different scales with Mamba-2 architecture. ... Table 2: The configuration and training details for Bi-Mamba. ... The training process uses the Adam optimizer with parameters β1 = 0.9 and β2 = 0.95. The initial learning rate is set at 2.5e 4 and follows a cosine schedule, decreasing to 2.5e 5 over 2,000 warm-up steps. Gradient clipping is set at 1.0. We train Bi-Mamba 780M, 1.3B, 2.7B with 30 data chunks, which are 105B tokens.