Dynamic-LLaVA: Efficient Multimodal Large Language Models via Dynamic Vision-language Context Sparsification
Authors: Wenxuan Huang, Zijie Zhai, Yunhang Shen, Shaosheng Cao, Fei Zhao, Xiangfeng Xu, Zheyu Ye, Shaohui Lin
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments also demonstrate that Dynamic-LLa VA achieves efficient inference for MLLMs with negligible understanding and generation ability degradation or even performance gains compared to the full-context inference baselines. |
| Researcher Affiliation | Collaboration | 1East China Normal University 2Xiamen University 3Xiaohongshu Inc. 4Nanjing University 5Key Laboratory of Advanced Theory and Application in Statistics and Data Science, MOE, China |
| Pseudocode | No | No explicit pseudocode or algorithm blocks are provided in the main text. The methodology is described using mathematical equations and textual explanations. |
| Open Source Code | Yes | Code is available at https://github.com/Osilly/dynamic_llava . |
| Open Datasets | Yes | For vision understanding evaluations, we use the commonly used vision understanding benchmarks to evaluate the performance similar as LLa VA-1.5, such as VQAv2 (Goyal et al., 2017), GQA (Hudson & Manning, 2019), Viz Wiz (Gurari et al., 2018), Sci QA (Lu et al., 2022), Text VQA (Singh et al., 2019), POPE (Li et al., 2023b), MMBench (en) (Liu et al., 2023b), SEED (image) (Li et al., 2023a) and MM-Vet (Yu et al., 2023). Furthermore, we also use the vision-centric vision understanding benchmarks, such as MMVP (Tong et al., 2024b), Real World QA (x AI, 2024) and CVBench-2D (Tong et al., 2024a). |
| Dataset Splits | No | No explicit train/validation/test dataset splits are provided in the paper. The paper mentions using "commonly used vision understanding benchmarks" and constructing specific subsets for evaluation (e.g., "selecting 1,000 instances from the LVISInstruct4V dataset"), but does not detail explicit percentages or counts for training, validation, and test splits within its own experimental setup. |
| Hardware Specification | Yes | All of the methods are trained on 8 NVIDIA A100 (80G) using Pytorch (Paszke et al., 2019). The results are measured in one A100 (80G) and the batch size is fixed to 8. |
| Software Dependencies | No | The paper mentions using Pytorch (Paszke et al., 2019) but does not provide a specific version number for it or any other software dependencies. |
| Experiment Setup | Yes | The hyper-parameters of which decoder layer to sparsify the tokens (l), the sample used for training should have a minimum output text token length (LENOT ) and the weight of the regularization term (λ) are set to 2, 50 and 100 in all experiments, respectively. During training, we freeze the vision encoder and projector, updating only the parameters of LLM and predictors. The initial learning rates for LLM and predictors are set at 5e-6 and 2e-4, respectively, with a fixed global batch size of 64. |