Cobra: Extending Mamba to Multi-Modal Large Language Model for Efficient Inference
Authors: Han Zhao, Min Zhang, Wei Zhao, Pengxiang Ding, Siteng Huang, Donglin Wang
AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments across various multi-modal benchmarks demonstrate that: (i) Cobra performs 3–4× faster than the most computationally efficient state-of-the-art methods, e.g., LLaVA-Phi and Mobile VLM v2. Additionally, its performance is significantly enhanced thanks to the implementation of linear sequential modeling. (ii) Cobra fine-tunes a small parameter (∼48% of model parameters), leading to a significant improvement in overall performance compared to LLaVA. |
| Researcher Affiliation | Collaboration | Han Zhao1 2*, Min Zhang1*, Wei Zhao2, Pengxiang Ding2, Siteng Huang3, Donglin Wang2 1Zhejiang University 2Westlake University 3DAMO Academy, Alibaba Group EMAIL, EMAIL, EMAIL, EMAIL, EMAIL, EMAIL |
| Pseudocode | No | The paper describes the methodology in prose and mathematical equations but does not include any structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | Code https://github.com/h-zhao1997/cobra |
| Open Datasets | Yes | We conduct our experiments on a diverse set of nine benchmarks, including (1) four open-ended visual question answering (VQA), i.e., VQA-v2 (Goyal et al. 2017), GQA (Hudson and Manning 2019), Viz Wiz (Gurari et al. 2018) and Text VQA (Singh et al. 2019). (2) two closed-set visual question answering (VQA), i.e., VSR (Liu, Emerson, and Collier 2023) and POPE (Li et al. 2023b). (3) three visual grounding, i.e., Ref COCO, Ref COCO+ and Ref COCOg (Kazemzadeh et al. 2014; Yu et al. 2016). |
| Dataset Splits | Yes | We conduct our experiments on a diverse set of nine benchmarks, including (1) four open-ended visual question answering (VQA), i.e., VQA-v2 (Goyal et al. 2017), GQA (Hudson and Manning 2019), Viz Wiz (Gurari et al. 2018) and Text VQA (Singh et al. 2019). (2) two closed-set visual question answering (VQA), i.e., VSR (Liu, Emerson, and Collier 2023) and POPE (Li et al. 2023b). (3) three visual grounding, i.e., Ref COCO, Ref COCO+ and Ref COCOg (Kazemzadeh et al. 2014; Yu et al. 2016). |
| Hardware Specification | Yes | The model is trained using 8 NVIDIA A100 80GB GPUs. We used the same question “Describe the image specifically” as the textual prompt and set the number of output tokens to 256 for all models. The total time Ttotal from the image encoding to finished generating the complete answer is recorded and we calculated the average number of tokens generated per second by Evalavg = 256/Ttotal. All the evaluations were done on the hardware with a single Nvidia A100 PCIe 80GB GPU. |
| Software Dependencies | No | The paper does not provide specific version numbers for software dependencies such as libraries or frameworks (e.g., PyTorch, TensorFlow, Python version). |
| Experiment Setup | Yes | Table 1: The configuration of models and hyperparameters. Vision Encoder DINOv2 + SigLIP ViT-SO LLM init. Mamba-2.8b-Zephyr / Mamba-7B Projector init. Random Image resolution 384 × 384 Image token num. 729 Global batch size 128 Training steps 19K Optimizer AdamW LR schedule Cosine decay Learning Rate 2e-5 Weight decay 0.1 Warm-up ratio 0.03 Number of epochs 2 |