$\gamma-$MoD: Exploring Mixture-of-Depth Adaptation for Multimodal Large Language Models

Authors: Yaxin Luo, Gen Luo, Jiayi Ji, Yiyi Zhou, Xiaoshuai Sun, Zhiqiang Shen, Rongrong Ji

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental To validate our method, we apply it to three popular MLLMs, and conduct extensive experiments on 9 benchmark datasets. Experimental results not only validate the significant efficiency benefit of γ-Mo D to existing MLLMs but also confirm its generalization ability on various MLLMs.
Researcher Affiliation Academia Yaxin Luo1, Gen Luo2 , Jiayi Ji3,4, Yiyi Zhou3, Xiaoshuai Sun3, Zhiqiang Shen1, Rongrong Ji3 1MBZUAI 2Open GVLab, Shanghai AI Laboratory 3Xiamen University 4National University of Singapore
Pseudocode No The paper does not contain any clearly labeled pseudocode or algorithm blocks. It describes the methodology in text and mathematical equations.
Open Source Code No The paper mentions "Project Page: Gamma-MOD" but does not provide a direct link to a code repository or an explicit statement about releasing the source code for the described methodology.
Open Datasets Yes We evaluate our γ-Mo D on five MLLM benchmarks, which includes POPE (Li et al., 2023), MME (Fu et al., 2024), MMB (Liu et al., 2024e), MMMU (Yue et al., 2024) and MM-Vet (Yu et al., 2023). We report all the results in their default settings. In addition, we evaluate γ-Mo D on six image question answering benchmarks: VQAv2 (Goyal et al., 2017), Viz Wiz (Gurari et al., 2018), Text VQA (Singh et al., 2019), SQA (Lu et al., 2022), GQA (Hudson & Manning, 2019) and SEED (Ge et al., 2023). ... For all models, pre-training is conducted on LCS-558K dataset (Liu et al., 2024b), which includes high-quality 558k image-text pairs. For instruction tuning, we follow LLa VA-1.5 (Liu et al., 2024b) to use 665k vision-language instruction data.
Dataset Splits Yes We report all the results in their default settings. ... For all models, pre-training is conducted on LCS-558K dataset (Liu et al., 2024b), which includes high-quality 558k image-text pairs. For instruction tuning, we follow LLa VA-1.5 (Liu et al., 2024b) to use 665k vision-language instruction data.
Hardware Specification Yes The inference efficiency is tested on an NVIDIA A100 GPU, which is the average value of GQA, SQA, MMMU, and Text VQA.
Software Dependencies No The paper mentions using existing MLLMs like LLa VA-HR and LLa VA, but it does not specify version numbers for any ancillary software dependencies (e.g., Python, PyTorch, CUDA versions) used in their implementation. It only states that "The remaining settings are kept the same with LLa VA-HR (Luo et al., 2024) and LLa VA (Liu et al., 2024b), including learning rate, training epochs, optimizer and datasets, etc.", which defers to other papers without providing specific versions.
Experiment Setup Yes For all models, the fourth largest ARank value is used as the threshold for converting dense layers to Mo D ones. During instruction tuning, the coefficient for the routing loss is set to 0.01. The remaining settings are kept the same with LLa VA-HR (Luo et al., 2024) and LLa VA (Liu et al., 2024b), including learning rate, training epochs, optimizer and datasets, etc.