Jamba: Hybrid Transformer-Mamba Language Models
Authors: Barak Lenz, Opher Lieber, Alan Arazi, Amir Bergman, Avshalom Manevich, Barak Peleg, Ben Aviram, Chen Almagor, Clara Fridman, Dan Padnos, Daniel Gissin, Daniel Jannai, Dor Muhlgay, Dor Zimberg, Edden Gerber, Elad Dolev, Eran Krakovsky, Erez Sa, Erez Schwartz, Gal Cohen, Gal Shachaf, Haim Rozenblum, Hofit Bata, Ido Blass, Inbal Magar, Itay Dalmedigos, Jhonathan Osin, Julie Fadlon, Maria Rozman, Matan Danos, Michael Gokhman, Mor Zusman, Naama Gidron, Nir Ratner, Noam Gat, Noam Rozen, Oded Fried, Ohad Leshno, Omer Antverg, Omri Abend, Or Dagan, Orit Cohavi, Raz Alon, Ro'i Belson, Roi Cohen, Rom Gilad, Roman Glozman, Shahar Lev, Shai Shalev-Shwartz, Shaked Meirom, Tal Delbari, Tal Ness, Tomer Asida, Tom Ben Gal, Tom Braude, Uriya Pumerantz, Joshua Cohen, Yonatan Belinkov, Yuval Globerson, Yuval Levy, Yoav Shoham
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluated the Jamba models on a wide range of benchmarks and found they performs comparably to state-of-the-art open-weight models of a similar or greater number of parameterts, while offering much better throughput. Notably, our models support a context length of 256K tokens the longest supported context length for production-grade publicly available models. |
| Researcher Affiliation | Industry | Models: https://huggingface.co/ai21labs (page 1) |
| Pseudocode | No | The paper describes the architecture and methodology in descriptive text and figures but does not include any clearly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | The model weights are publicly available. Models: https://huggingface.co/ai21labs (...) We make the Jamba models publicly available under the Jamba Open Model License to support further study, experimentation, and optimization of this novel architecture by the community: Jamba-1.5-Mini: https://huggingface.co/ai21labs/AI21-Jamba-1.5-Mini Jamba-1.5-Large: https://huggingface.co/ai21labs/AI21-Jamba-1.5-Large (...) We have contributed our modified fused_moe kernel to vLLM.4 https://github.com/vllm-project/vllm/pull/7415 |
| Open Datasets | Yes | We evaluate on the RULER benchmark (Hsieh et al., 2024)... Next we evaluate on BENCH (Zhang et al., 2024)... standard academic benchmarks: MMLU (Hendrycks et al., 2020), MMLU-Pro (Wang et al., 2024), GPQA (Rein et al., 2023), ARC-Challenge (Clark et al., 2018), BBH (Suzgun et al., 2023), and Human Eval (Chen et al., 2021). We also evaluate on the IFEval instruction following dataset (Zhou et al., 2023) and the BFCL v1 function calling dataset (Yan et al., 2024). Finally, we report safety evaluations on Real Toxicity (Gehman et al., 2020) and Truthful QA (Lin et al., 2022). |
| Dataset Splits | Yes | We have made sure to use the same evaluation setup in all models compared in this work, including prompts, official splits, and number of shots. When evaluating Jamba and self-reporting results of other models, we always used the official repositories. We mainly used the LM Evaluation Harness (Biderman et al., 2024) whenever possible. |
| Hardware Specification | Yes | Jamba-1.5-Large was trained on NVIDIA H100 GPUs... Jamba-1.5-Mini ... designed to fit on a single 80GB GPU and Jamba-1.5-Large ... a single 8x80GB GPU machine. (...) All measurements are on 2x A100 80GB GPUs, with batch size 1 and output length 512 tokens. (...) All measurements are on 8x A100 80GB GPUs, with batch size 1 and output length 512 tokens. |
| Software Dependencies | No | We have contributed our modified fused_moe kernel to vLLM.4 https://github.com/vllm-project/vllm/pull/7415. The paper mentions software like vLLM, FSDP, Sentence Piece BPE, and LM Evaluation Harness, but does not provide specific version numbers for these components. |
| Experiment Setup | Yes | Concretely, our implementation of Jamba uses a sequence of Jamba blocks (4 blocks in Jamba-1.5-Mini, 9 in Jamba-1.5-Large). Each block has the following configuration: Number of layers (l) 8; Total number of experts (n) 16; Ratio of attention-to-Mamba layers (a : m) 1 : 7; # top experts used at each token (K) 2; Use Mo E instead of MLP every e layers 2. For Jamba-1.5-Large, we used α = 10 5. The pre-training context length was 4K for Jamba-1.5-Mini and 8K for Jamba-1.5-Large. |