Dynamic-LLaVA: Efficient Multimodal Large Language Models via Dynamic Vision-language Context Sparsification

Authors: Wenxuan Huang, Zijie Zhai, Yunhang Shen, Shaosheng Cao, Fei Zhao, Xiangfeng Xu, Zheyu Ye, Shaohui Lin

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments also demonstrate that Dynamic-LLa VA achieves efficient inference for MLLMs with negligible understanding and generation ability degradation or even performance gains compared to the full-context inference baselines.
Researcher Affiliation Collaboration 1East China Normal University 2Xiamen University 3Xiaohongshu Inc. 4Nanjing University 5Key Laboratory of Advanced Theory and Application in Statistics and Data Science, MOE, China
Pseudocode No No explicit pseudocode or algorithm blocks are provided in the main text. The methodology is described using mathematical equations and textual explanations.
Open Source Code Yes Code is available at https://github.com/Osilly/dynamic_llava .
Open Datasets Yes For vision understanding evaluations, we use the commonly used vision understanding benchmarks to evaluate the performance similar as LLa VA-1.5, such as VQAv2 (Goyal et al., 2017), GQA (Hudson & Manning, 2019), Viz Wiz (Gurari et al., 2018), Sci QA (Lu et al., 2022), Text VQA (Singh et al., 2019), POPE (Li et al., 2023b), MMBench (en) (Liu et al., 2023b), SEED (image) (Li et al., 2023a) and MM-Vet (Yu et al., 2023). Furthermore, we also use the vision-centric vision understanding benchmarks, such as MMVP (Tong et al., 2024b), Real World QA (x AI, 2024) and CVBench-2D (Tong et al., 2024a).
Dataset Splits No No explicit train/validation/test dataset splits are provided in the paper. The paper mentions using "commonly used vision understanding benchmarks" and constructing specific subsets for evaluation (e.g., "selecting 1,000 instances from the LVISInstruct4V dataset"), but does not detail explicit percentages or counts for training, validation, and test splits within its own experimental setup.
Hardware Specification Yes All of the methods are trained on 8 NVIDIA A100 (80G) using Pytorch (Paszke et al., 2019). The results are measured in one A100 (80G) and the batch size is fixed to 8.
Software Dependencies No The paper mentions using Pytorch (Paszke et al., 2019) but does not provide a specific version number for it or any other software dependencies.
Experiment Setup Yes The hyper-parameters of which decoder layer to sparsify the tokens (l), the sample used for training should have a minimum output text token length (LENOT ) and the weight of the regularization term (λ) are set to 2, 50 and 100 in all experiments, respectively. During training, we freeze the vision encoder and projector, updating only the parameters of LLM and predictors. The initial learning rates for LLM and predictors are set at 5e-6 and 2e-4, respectively, with a fixed global batch size of 64.