reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

EMMA: Empowering Multi-modal Mamba with Structural and Hierarchical Alignment

Authors: Yifei Xing, Xiangyuan Lan, Ruiping Wang, Dongmei Jiang, Wenjun Huang, Zheng Qingfang, Yaowei Wang

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments are conducted across a variety of multi-modal benchmarks. Our model shows lower latency than other Mamba-based MLLMs and is nearly four times faster than transformer-based MLLMs of similar scale during inference. Due to better cross-modal alignment, our model exhibits lower degrees of hallucination and enhanced sensitivity to visual details, which manifests in superior performance across diverse multi-modal benchmarks.
Researcher Affiliation	Academia	1Institute of Computing Technology, Chinese Academy of Sciences, 2Pengcheng Laboratory, 3University of Chinese Academy of Sciences, 4Sun Yat-sen University 5Pazhou Laboratory (Huangpu) EMAIL, EMAIL EMAIL, EMAIL
Pseudocode	No	The paper describes the methodology using textual explanations and mathematical formulations (Equations 1-9), but no structured pseudocode or algorithm blocks are provided.
Open Source Code	Yes	Code provided at https://github.com/xingyifei2016/EMMA.
Open Datasets	Yes	Following Zhao et al. (2024a), we train EMMA on a combination of datasets consisting of LLa VA-v1.5-mixed-665k Liu et al. (2024b), LVIS-Instruct-4V Wang et al. (2023), and LRV-Instruct Liu et al. (2023b).
Dataset Splits	Yes	We evaluate our model variants on four open-ended visual question-answer benchmarks: VQAv2 Goyal et al. (2017) and Viz Wiz Gurari et al. (2018) test general visual reasoning, GQA Hudson & Manning (2019) validates spatial reasoning, and Text VQA Singh et al. (2019) assesses reasoning around text. We also evaluate our models on nine comprehensive closed-set benchmarks: VSR Liu et al. (2023a).
Hardware Specification	Yes	Our models are trained on eight 40G A100 GPUs with fully sharded data parallelism Zhao et al. (2023b). All evaluations are conducted on a single 40G A100 GPU.
Software Dependencies	No	The paper mentions using a "GPTNeo XTokenizer Fast tokenizer" and "Sig LIP and DINOv2 models" but does not specify version numbers for Python, PyTorch, CUDA, or any other general software libraries.
Experiment Setup	Yes	We directly finetune the Mamba LLM backbone, the multi-scale fusion module, the image decoder, and the MLP projector on the training data for two epochs, discarding the pretrain phase. The visual encoder is frozen at all times. We select a global batch size of 128 and a starting learning rate of 2e-5 with Adam W optimization. Our models are trained on eight 40G A100 GPUs with fully sharded data parallelism Zhao et al. (2023b).