reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Dynamic-LLaVA: Efficient Multimodal Large Language Models via Dynamic Vision-language Context Sparsification

Authors: Wenxuan Huang, Zijie Zhai, Yunhang Shen, Shaosheng Cao, Fei Zhao, Xiangfeng Xu, Zheyu Ye, Shaohui Lin

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments also demonstrate that Dynamic-LLa VA achieves efficient inference for MLLMs with negligible understanding and generation ability degradation or even performance gains compared to the full-context inference baselines.
Researcher Affiliation	Collaboration	1East China Normal University 2Xiamen University 3Xiaohongshu Inc. 4Nanjing University 5Key Laboratory of Advanced Theory and Application in Statistics and Data Science, MOE, China
Pseudocode	No	No explicit pseudocode or algorithm blocks are provided in the main text. The methodology is described using mathematical equations and textual explanations.
Open Source Code	Yes	Code is available at https://github.com/Osilly/dynamic_llava .
Open Datasets	Yes	For vision understanding evaluations, we use the commonly used vision understanding benchmarks to evaluate the performance similar as LLa VA-1.5, such as VQAv2 (Goyal et al., 2017), GQA (Hudson & Manning, 2019), Viz Wiz (Gurari et al., 2018), Sci QA (Lu et al., 2022), Text VQA (Singh et al., 2019), POPE (Li et al., 2023b), MMBench (en) (Liu et al., 2023b), SEED (image) (Li et al., 2023a) and MM-Vet (Yu et al., 2023). Furthermore, we also use the vision-centric vision understanding benchmarks, such as MMVP (Tong et al., 2024b), Real World QA (x AI, 2024) and CVBench-2D (Tong et al., 2024a).
Dataset Splits	No	No explicit train/validation/test dataset splits are provided in the paper. The paper mentions using "commonly used vision understanding benchmarks" and constructing specific subsets for evaluation (e.g., "selecting 1,000 instances from the LVISInstruct4V dataset"), but does not detail explicit percentages or counts for training, validation, and test splits within its own experimental setup.
Hardware Specification	Yes	All of the methods are trained on 8 NVIDIA A100 (80G) using Pytorch (Paszke et al., 2019). The results are measured in one A100 (80G) and the batch size is fixed to 8.
Software Dependencies	No	The paper mentions using Pytorch (Paszke et al., 2019) but does not provide a specific version number for it or any other software dependencies.
Experiment Setup	Yes	The hyper-parameters of which decoder layer to sparsify the tokens (l), the sample used for training should have a minimum output text token length (LENOT ) and the weight of the regularization term (λ) are set to 2, 50 and 100 in all experiments, respectively. During training, we freeze the vision encoder and projector, updating only the parameters of LLM and predictors. The initial learning rates for LLM and predictors are set at 5e-6 and 2e-4, respectively, with a fixed global batch size of 64.