reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

ST3: Accelerating Multimodal Large Language Model by Spatial-Temporal Visual Token Trimming

Authors: Jiedong Zhuang, Lu Lu, Ming Dai, Rui Hu, Jian Chen, Qiang Liu, Haoji Hu

AAAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this paper, we conduct a comprehensive investigation of MLLM attention mechanisms with LLaVA. We find that numerous visual tokens and partial attention computations are redundant during the decoding process. ... Together, these techniques deliver around 2 faster inference with only about 30% KV cache memory compared to the original LLaVA, while maintaining consistent performance across various datasets. Experiments Comparison on Single-token Answer Datasets We compare our method with existing methods on mainstream single-token answer datasets, with the results shown in Tab. 1. ... Ablation on Progressive Visual Token Pruning We conduct ablation experiments on pruning parameters stride S and pruning ratio R.
Researcher Affiliation	Collaboration	Jiedong Zhuang1,2, Lu Lu2, Ming Dai1, Rui Hu1, Jian Chen2, Qiang Liu2, Haoji Hu1* 1Zhejiang University 2Alibaba Cloud Computing
Pseudocode	No	The paper describes its methodology using mathematical formulations and textual explanations, along with a high-level framework illustration in Figure 5, but does not include any explicitly labeled 'Pseudocode' or 'Algorithm' blocks.
Open Source Code	No	The paper does not contain any explicit statement about releasing source code for the methodology, nor does it provide a link to a code repository. Statements like 'We release our code...' or 'Code is available at...' are absent.
Open Datasets	Yes	Comparison of various models on dataset Science QA Img (Lu et al. 2022) ... The best results are in bold. Table 2: Comparison of existing training-free token pruning method on MLLMs with image caption datasets. ... Coco2017 refers to validation subset of COCO2017 caption(Chen et al. 2015). Flickr30k and Nocaps are validation and test splits in the original datasets(Plummer et al. 2015; Agrawal et al. 2019).
Dataset Splits	Yes	SQA means the Science QA(Lu et al. 2022) image subset, MMMU represents the validation subset of MMMU(Yue et al. 2024), and MMB denotes the english subset of MMBench(Liu et al. 2023). ... Coco2017 refers to validation subset of COCO2017 caption(Chen et al. 2015). Flickr30k and Nocaps are validation and test splits in the original datasets(Plummer et al. 2015; Agrawal et al. 2019).
Hardware Specification	No	The paper discusses FLOPs and latency as performance metrics but does not specify any particular hardware components such as GPU models, CPU types, or memory sizes used for running the experiments. It lacks concrete details like 'NVIDIA A100' or 'Intel Xeon'.
Software Dependencies	No	The paper does not provide specific version numbers for any software libraries, frameworks, or programming languages used in the implementation of the proposed method or experiments. There is no mention of details like 'Python 3.8' or 'PyTorch 1.9'.
Experiment Setup	Yes	Ablation on Progressive Visual Token Pruning We conduct ablation experiments on pruning parameters stride S and pruning ratio R. The proportion of conserved visual tokens in the last layer C can be calculated as: C = 1 (L 3)/S R P (9) where P means pruning ratio in 4-th layer, which is set to 50% inspired by Fast V (Chen et al. 2024b), and is the round-down operation. Tab. 3 shows the results when C = 1% on LLaVA-1.5-7B and C = 5% on LLaVA-1.5-13B. ... A study of the parameter τ is conducted in the cosine function.