$\gamma-$MoD: Exploring Mixture-of-Depth Adaptation for Multimodal Large Language Models
Authors: Yaxin Luo, Gen Luo, Jiayi Ji, Yiyi Zhou, Xiaoshuai Sun, Zhiqiang Shen, Rongrong Ji
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | To validate our method, we apply it to three popular MLLMs, and conduct extensive experiments on 9 benchmark datasets. Experimental results not only validate the significant efficiency benefit of γ-Mo D to existing MLLMs but also confirm its generalization ability on various MLLMs. |
| Researcher Affiliation | Academia | Yaxin Luo1, Gen Luo2 , Jiayi Ji3,4, Yiyi Zhou3, Xiaoshuai Sun3, Zhiqiang Shen1, Rongrong Ji3 1MBZUAI 2Open GVLab, Shanghai AI Laboratory 3Xiamen University 4National University of Singapore |
| Pseudocode | No | The paper does not contain any clearly labeled pseudocode or algorithm blocks. It describes the methodology in text and mathematical equations. |
| Open Source Code | No | The paper mentions "Project Page: Gamma-MOD" but does not provide a direct link to a code repository or an explicit statement about releasing the source code for the described methodology. |
| Open Datasets | Yes | We evaluate our γ-Mo D on five MLLM benchmarks, which includes POPE (Li et al., 2023), MME (Fu et al., 2024), MMB (Liu et al., 2024e), MMMU (Yue et al., 2024) and MM-Vet (Yu et al., 2023). We report all the results in their default settings. In addition, we evaluate γ-Mo D on six image question answering benchmarks: VQAv2 (Goyal et al., 2017), Viz Wiz (Gurari et al., 2018), Text VQA (Singh et al., 2019), SQA (Lu et al., 2022), GQA (Hudson & Manning, 2019) and SEED (Ge et al., 2023). ... For all models, pre-training is conducted on LCS-558K dataset (Liu et al., 2024b), which includes high-quality 558k image-text pairs. For instruction tuning, we follow LLa VA-1.5 (Liu et al., 2024b) to use 665k vision-language instruction data. |
| Dataset Splits | Yes | We report all the results in their default settings. ... For all models, pre-training is conducted on LCS-558K dataset (Liu et al., 2024b), which includes high-quality 558k image-text pairs. For instruction tuning, we follow LLa VA-1.5 (Liu et al., 2024b) to use 665k vision-language instruction data. |
| Hardware Specification | Yes | The inference efficiency is tested on an NVIDIA A100 GPU, which is the average value of GQA, SQA, MMMU, and Text VQA. |
| Software Dependencies | No | The paper mentions using existing MLLMs like LLa VA-HR and LLa VA, but it does not specify version numbers for any ancillary software dependencies (e.g., Python, PyTorch, CUDA versions) used in their implementation. It only states that "The remaining settings are kept the same with LLa VA-HR (Luo et al., 2024) and LLa VA (Liu et al., 2024b), including learning rate, training epochs, optimizer and datasets, etc.", which defers to other papers without providing specific versions. |
| Experiment Setup | Yes | For all models, the fourth largest ARank value is used as the threshold for converting dense layers to Mo D ones. During instruction tuning, the coefficient for the routing loss is set to 0.01. The remaining settings are kept the same with LLa VA-HR (Luo et al., 2024) and LLa VA (Liu et al., 2024b), including learning rate, training epochs, optimizer and datasets, etc. |