EMMA: Empowering Multi-modal Mamba with Structural and Hierarchical Alignment
Authors: Yifei Xing, Xiangyuan Lan, Ruiping Wang, Dongmei Jiang, Wenjun Huang, Zheng Qingfang, Yaowei Wang
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments are conducted across a variety of multi-modal benchmarks. Our model shows lower latency than other Mamba-based MLLMs and is nearly four times faster than transformer-based MLLMs of similar scale during inference. Due to better cross-modal alignment, our model exhibits lower degrees of hallucination and enhanced sensitivity to visual details, which manifests in superior performance across diverse multi-modal benchmarks. |
| Researcher Affiliation | Academia | 1Institute of Computing Technology, Chinese Academy of Sciences, 2Pengcheng Laboratory, 3University of Chinese Academy of Sciences, 4Sun Yat-sen University 5Pazhou Laboratory (Huangpu) EMAIL, EMAIL EMAIL, EMAIL |
| Pseudocode | No | The paper describes the methodology using textual explanations and mathematical formulations (Equations 1-9), but no structured pseudocode or algorithm blocks are provided. |
| Open Source Code | Yes | Code provided at https://github.com/xingyifei2016/EMMA. |
| Open Datasets | Yes | Following Zhao et al. (2024a), we train EMMA on a combination of datasets consisting of LLa VA-v1.5-mixed-665k Liu et al. (2024b), LVIS-Instruct-4V Wang et al. (2023), and LRV-Instruct Liu et al. (2023b). |
| Dataset Splits | Yes | We evaluate our model variants on four open-ended visual question-answer benchmarks: VQAv2 Goyal et al. (2017) and Viz Wiz Gurari et al. (2018) test general visual reasoning, GQA Hudson & Manning (2019) validates spatial reasoning, and Text VQA Singh et al. (2019) assesses reasoning around text. We also evaluate our models on nine comprehensive closed-set benchmarks: VSR Liu et al. (2023a). |
| Hardware Specification | Yes | Our models are trained on eight 40G A100 GPUs with fully sharded data parallelism Zhao et al. (2023b). All evaluations are conducted on a single 40G A100 GPU. |
| Software Dependencies | No | The paper mentions using a "GPTNeo XTokenizer Fast tokenizer" and "Sig LIP and DINOv2 models" but does not specify version numbers for Python, PyTorch, CUDA, or any other general software libraries. |
| Experiment Setup | Yes | We directly finetune the Mamba LLM backbone, the multi-scale fusion module, the image decoder, and the MLP projector on the training data for two epochs, discarding the pretrain phase. The visual encoder is frozen at all times. We select a global batch size of 128 and a starting learning rate of 2e-5 with Adam W optimization. Our models are trained on eight 40G A100 GPUs with fully sharded data parallelism Zhao et al. (2023b). |