reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

$\gamma-$MoD: Exploring Mixture-of-Depth Adaptation for Multimodal Large Language Models

Authors: Yaxin Luo, Gen Luo, Jiayi Ji, Yiyi Zhou, Xiaoshuai Sun, Zhiqiang Shen, Rongrong Ji

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	To validate our method, we apply it to three popular MLLMs, and conduct extensive experiments on 9 benchmark datasets. Experimental results not only validate the significant efficiency benefit of γ-Mo D to existing MLLMs but also confirm its generalization ability on various MLLMs.
Researcher Affiliation	Academia	Yaxin Luo1, Gen Luo2 , Jiayi Ji3,4, Yiyi Zhou3, Xiaoshuai Sun3, Zhiqiang Shen1, Rongrong Ji3 1MBZUAI 2Open GVLab, Shanghai AI Laboratory 3Xiamen University 4National University of Singapore
Pseudocode	No	The paper does not contain any clearly labeled pseudocode or algorithm blocks. It describes the methodology in text and mathematical equations.
Open Source Code	No	The paper mentions "Project Page: Gamma-MOD" but does not provide a direct link to a code repository or an explicit statement about releasing the source code for the described methodology.
Open Datasets	Yes	We evaluate our γ-Mo D on five MLLM benchmarks, which includes POPE (Li et al., 2023), MME (Fu et al., 2024), MMB (Liu et al., 2024e), MMMU (Yue et al., 2024) and MM-Vet (Yu et al., 2023). We report all the results in their default settings. In addition, we evaluate γ-Mo D on six image question answering benchmarks: VQAv2 (Goyal et al., 2017), Viz Wiz (Gurari et al., 2018), Text VQA (Singh et al., 2019), SQA (Lu et al., 2022), GQA (Hudson & Manning, 2019) and SEED (Ge et al., 2023). ... For all models, pre-training is conducted on LCS-558K dataset (Liu et al., 2024b), which includes high-quality 558k image-text pairs. For instruction tuning, we follow LLa VA-1.5 (Liu et al., 2024b) to use 665k vision-language instruction data.
Dataset Splits	Yes	We report all the results in their default settings. ... For all models, pre-training is conducted on LCS-558K dataset (Liu et al., 2024b), which includes high-quality 558k image-text pairs. For instruction tuning, we follow LLa VA-1.5 (Liu et al., 2024b) to use 665k vision-language instruction data.
Hardware Specification	Yes	The inference efficiency is tested on an NVIDIA A100 GPU, which is the average value of GQA, SQA, MMMU, and Text VQA.
Software Dependencies	No	The paper mentions using existing MLLMs like LLa VA-HR and LLa VA, but it does not specify version numbers for any ancillary software dependencies (e.g., Python, PyTorch, CUDA versions) used in their implementation. It only states that "The remaining settings are kept the same with LLa VA-HR (Luo et al., 2024) and LLa VA (Liu et al., 2024b), including learning rate, training epochs, optimizer and datasets, etc.", which defers to other papers without providing specific versions.
Experiment Setup	Yes	For all models, the fourth largest ARank value is used as the threshold for converting dense layers to Mo D ones. During instruction tuning, the coefficient for the routing loss is set to 0.01. The remaining settings are kept the same with LLa VA-HR (Luo et al., 2024) and LLa VA (Liu et al., 2024b), including learning rate, training epochs, optimizer and datasets, etc.