Cobra: Extending Mamba to Multi-Modal Large Language Model for Efficient Inference

Authors: Han Zhao, Min Zhang, Wei Zhao, Pengxiang Ding, Siteng Huang, Donglin Wang

AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments across various multi-modal benchmarks demonstrate that: (i) Cobra performs 3–4× faster than the most computationally efficient state-of-the-art methods, e.g., LLaVA-Phi and Mobile VLM v2. Additionally, its performance is significantly enhanced thanks to the implementation of linear sequential modeling. (ii) Cobra fine-tunes a small parameter (∼48% of model parameters), leading to a significant improvement in overall performance compared to LLaVA.
Researcher Affiliation Collaboration Han Zhao1 2*, Min Zhang1*, Wei Zhao2, Pengxiang Ding2, Siteng Huang3, Donglin Wang2 1Zhejiang University 2Westlake University 3DAMO Academy, Alibaba Group EMAIL, EMAIL, EMAIL, EMAIL, EMAIL, EMAIL
Pseudocode No The paper describes the methodology in prose and mathematical equations but does not include any structured pseudocode or algorithm blocks.
Open Source Code Yes Code https://github.com/h-zhao1997/cobra
Open Datasets Yes We conduct our experiments on a diverse set of nine benchmarks, including (1) four open-ended visual question answering (VQA), i.e., VQA-v2 (Goyal et al. 2017), GQA (Hudson and Manning 2019), Viz Wiz (Gurari et al. 2018) and Text VQA (Singh et al. 2019). (2) two closed-set visual question answering (VQA), i.e., VSR (Liu, Emerson, and Collier 2023) and POPE (Li et al. 2023b). (3) three visual grounding, i.e., Ref COCO, Ref COCO+ and Ref COCOg (Kazemzadeh et al. 2014; Yu et al. 2016).
Dataset Splits Yes We conduct our experiments on a diverse set of nine benchmarks, including (1) four open-ended visual question answering (VQA), i.e., VQA-v2 (Goyal et al. 2017), GQA (Hudson and Manning 2019), Viz Wiz (Gurari et al. 2018) and Text VQA (Singh et al. 2019). (2) two closed-set visual question answering (VQA), i.e., VSR (Liu, Emerson, and Collier 2023) and POPE (Li et al. 2023b). (3) three visual grounding, i.e., Ref COCO, Ref COCO+ and Ref COCOg (Kazemzadeh et al. 2014; Yu et al. 2016).
Hardware Specification Yes The model is trained using 8 NVIDIA A100 80GB GPUs. We used the same question “Describe the image specifically” as the textual prompt and set the number of output tokens to 256 for all models. The total time Ttotal from the image encoding to finished generating the complete answer is recorded and we calculated the average number of tokens generated per second by Evalavg = 256/Ttotal. All the evaluations were done on the hardware with a single Nvidia A100 PCIe 80GB GPU.
Software Dependencies No The paper does not provide specific version numbers for software dependencies such as libraries or frameworks (e.g., PyTorch, TensorFlow, Python version).
Experiment Setup Yes Table 1: The configuration of models and hyperparameters. Vision Encoder DINOv2 + SigLIP ViT-SO LLM init. Mamba-2.8b-Zephyr / Mamba-7B Projector init. Random Image resolution 384 × 384 Image token num. 729 Global batch size 128 Training steps 19K Optimizer AdamW LR schedule Cosine decay Learning Rate 2e-5 Weight decay 0.1 Warm-up ratio 0.03 Number of epochs 2