reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Cobra: Extending Mamba to Multi-Modal Large Language Model for Efficient Inference

Authors: Han Zhao, Min Zhang, Wei Zhao, Pengxiang Ding, Siteng Huang, Donglin Wang

AAAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experiments across various multi-modal benchmarks demonstrate that: (i) Cobra performs 3–4× faster than the most computationally efficient state-of-the-art methods, e.g., LLaVA-Phi and Mobile VLM v2. Additionally, its performance is significantly enhanced thanks to the implementation of linear sequential modeling. (ii) Cobra fine-tunes a small parameter (∼48% of model parameters), leading to a significant improvement in overall performance compared to LLaVA.
Researcher Affiliation	Collaboration	Han Zhao1 2, Min Zhang1, Wei Zhao2, Pengxiang Ding2, Siteng Huang3, Donglin Wang2 1Zhejiang University 2Westlake University 3DAMO Academy, Alibaba Group EMAIL, EMAIL, EMAIL, EMAIL, EMAIL, EMAIL
Pseudocode	No	The paper describes the methodology in prose and mathematical equations but does not include any structured pseudocode or algorithm blocks.
Open Source Code	Yes	Code https://github.com/h-zhao1997/cobra
Open Datasets	Yes	We conduct our experiments on a diverse set of nine benchmarks, including (1) four open-ended visual question answering (VQA), i.e., VQA-v2 (Goyal et al. 2017), GQA (Hudson and Manning 2019), Viz Wiz (Gurari et al. 2018) and Text VQA (Singh et al. 2019). (2) two closed-set visual question answering (VQA), i.e., VSR (Liu, Emerson, and Collier 2023) and POPE (Li et al. 2023b). (3) three visual grounding, i.e., Ref COCO, Ref COCO+ and Ref COCOg (Kazemzadeh et al. 2014; Yu et al. 2016).
Dataset Splits	Yes	We conduct our experiments on a diverse set of nine benchmarks, including (1) four open-ended visual question answering (VQA), i.e., VQA-v2 (Goyal et al. 2017), GQA (Hudson and Manning 2019), Viz Wiz (Gurari et al. 2018) and Text VQA (Singh et al. 2019). (2) two closed-set visual question answering (VQA), i.e., VSR (Liu, Emerson, and Collier 2023) and POPE (Li et al. 2023b). (3) three visual grounding, i.e., Ref COCO, Ref COCO+ and Ref COCOg (Kazemzadeh et al. 2014; Yu et al. 2016).
Hardware Specification	Yes	The model is trained using 8 NVIDIA A100 80GB GPUs. We used the same question “Describe the image specifically” as the textual prompt and set the number of output tokens to 256 for all models. The total time Ttotal from the image encoding to finished generating the complete answer is recorded and we calculated the average number of tokens generated per second by Evalavg = 256/Ttotal. All the evaluations were done on the hardware with a single Nvidia A100 PCIe 80GB GPU.
Software Dependencies	No	The paper does not provide specific version numbers for software dependencies such as libraries or frameworks (e.g., PyTorch, TensorFlow, Python version).
Experiment Setup	Yes	Table 1: The configuration of models and hyperparameters. Vision Encoder DINOv2 + SigLIP ViT-SO LLM init. Mamba-2.8b-Zephyr / Mamba-7B Projector init. Random Image resolution 384 × 384 Image token num. 729 Global batch size 128 Training steps 19K Optimizer AdamW LR schedule Cosine decay Learning Rate 2e-5 Weight decay 0.1 Warm-up ratio 0.03 Number of epochs 2