reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Jamba: Hybrid Transformer-Mamba Language Models

Authors: Barak Lenz, Opher Lieber, Alan Arazi, Amir Bergman, Avshalom Manevich, Barak Peleg, Ben Aviram, Chen Almagor, Clara Fridman, Dan Padnos, Daniel Gissin, Daniel Jannai, Dor Muhlgay, Dor Zimberg, Edden Gerber, Elad Dolev, Eran Krakovsky, Erez Sa, Erez Schwartz, Gal Cohen, Gal Shachaf, Haim Rozenblum, Hofit Bata, Ido Blass, Inbal Magar, Itay Dalmedigos, Jhonathan Osin, Julie Fadlon, Maria Rozman, Matan Danos, Michael Gokhman, Mor Zusman, Naama Gidron, Nir Ratner, Noam Gat, Noam Rozen, Oded Fried, Ohad Leshno, Omer Antverg, Omri Abend, Or Dagan, Orit Cohavi, Raz Alon, Ro'i Belson, Roi Cohen, Rom Gilad, Roman Glozman, Shahar Lev, Shai Shalev-Shwartz, Shaked Meirom, Tal Delbari, Tal Ness, Tomer Asida, Tom Ben Gal, Tom Braude, Uriya Pumerantz, Joshua Cohen, Yonatan Belinkov, Yuval Globerson, Yuval Levy, Yoav Shoham

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluated the Jamba models on a wide range of benchmarks and found they performs comparably to state-of-the-art open-weight models of a similar or greater number of parameterts, while offering much better throughput. Notably, our models support a context length of 256K tokens the longest supported context length for production-grade publicly available models.
Researcher Affiliation	Industry	Models: https://huggingface.co/ai21labs (page 1)
Pseudocode	No	The paper describes the architecture and methodology in descriptive text and figures but does not include any clearly labeled pseudocode or algorithm blocks.
Open Source Code	Yes	The model weights are publicly available. Models: https://huggingface.co/ai21labs (...) We make the Jamba models publicly available under the Jamba Open Model License to support further study, experimentation, and optimization of this novel architecture by the community: Jamba-1.5-Mini: https://huggingface.co/ai21labs/AI21-Jamba-1.5-Mini Jamba-1.5-Large: https://huggingface.co/ai21labs/AI21-Jamba-1.5-Large (...) We have contributed our modified fused_moe kernel to vLLM.4 https://github.com/vllm-project/vllm/pull/7415
Open Datasets	Yes	We evaluate on the RULER benchmark (Hsieh et al., 2024)... Next we evaluate on BENCH (Zhang et al., 2024)... standard academic benchmarks: MMLU (Hendrycks et al., 2020), MMLU-Pro (Wang et al., 2024), GPQA (Rein et al., 2023), ARC-Challenge (Clark et al., 2018), BBH (Suzgun et al., 2023), and Human Eval (Chen et al., 2021). We also evaluate on the IFEval instruction following dataset (Zhou et al., 2023) and the BFCL v1 function calling dataset (Yan et al., 2024). Finally, we report safety evaluations on Real Toxicity (Gehman et al., 2020) and Truthful QA (Lin et al., 2022).
Dataset Splits	Yes	We have made sure to use the same evaluation setup in all models compared in this work, including prompts, official splits, and number of shots. When evaluating Jamba and self-reporting results of other models, we always used the official repositories. We mainly used the LM Evaluation Harness (Biderman et al., 2024) whenever possible.
Hardware Specification	Yes	Jamba-1.5-Large was trained on NVIDIA H100 GPUs... Jamba-1.5-Mini ... designed to fit on a single 80GB GPU and Jamba-1.5-Large ... a single 8x80GB GPU machine. (...) All measurements are on 2x A100 80GB GPUs, with batch size 1 and output length 512 tokens. (...) All measurements are on 8x A100 80GB GPUs, with batch size 1 and output length 512 tokens.
Software Dependencies	No	We have contributed our modified fused_moe kernel to vLLM.4 https://github.com/vllm-project/vllm/pull/7415. The paper mentions software like vLLM, FSDP, Sentence Piece BPE, and LM Evaluation Harness, but does not provide specific version numbers for these components.
Experiment Setup	Yes	Concretely, our implementation of Jamba uses a sequence of Jamba blocks (4 blocks in Jamba-1.5-Mini, 9 in Jamba-1.5-Large). Each block has the following configuration: Number of layers (l) 8; Total number of experts (n) 16; Ratio of attention-to-Mamba layers (a : m) 1 : 7; # top experts used at each token (K) 2; Use Mo E instead of MLP every e layers 2. For Jamba-1.5-Large, we used α = 10 5. The pre-training context length was 4K for Jamba-1.5-Mini and 8K for Jamba-1.5-Large.