ST3: Accelerating Multimodal Large Language Model by Spatial-Temporal Visual Token Trimming
Authors: Jiedong Zhuang, Lu Lu, Ming Dai, Rui Hu, Jian Chen, Qiang Liu, Haoji Hu
AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this paper, we conduct a comprehensive investigation of MLLM attention mechanisms with LLaVA. We find that numerous visual tokens and partial attention computations are redundant during the decoding process. ... Together, these techniques deliver around 2 faster inference with only about 30% KV cache memory compared to the original LLaVA, while maintaining consistent performance across various datasets. Experiments Comparison on Single-token Answer Datasets We compare our method with existing methods on mainstream single-token answer datasets, with the results shown in Tab. 1. ... Ablation on Progressive Visual Token Pruning We conduct ablation experiments on pruning parameters stride S and pruning ratio R. |
| Researcher Affiliation | Collaboration | Jiedong Zhuang1,2, Lu Lu2, Ming Dai1, Rui Hu1, Jian Chen2, Qiang Liu2, Haoji Hu1* 1Zhejiang University 2Alibaba Cloud Computing |
| Pseudocode | No | The paper describes its methodology using mathematical formulations and textual explanations, along with a high-level framework illustration in Figure 5, but does not include any explicitly labeled 'Pseudocode' or 'Algorithm' blocks. |
| Open Source Code | No | The paper does not contain any explicit statement about releasing source code for the methodology, nor does it provide a link to a code repository. Statements like 'We release our code...' or 'Code is available at...' are absent. |
| Open Datasets | Yes | Comparison of various models on dataset Science QA Img (Lu et al. 2022) ... The best results are in bold. Table 2: Comparison of existing training-free token pruning method on MLLMs with image caption datasets. ... Coco2017 refers to validation subset of COCO2017 caption(Chen et al. 2015). Flickr30k and Nocaps are validation and test splits in the original datasets(Plummer et al. 2015; Agrawal et al. 2019). |
| Dataset Splits | Yes | SQA means the Science QA(Lu et al. 2022) image subset, MMMU represents the validation subset of MMMU(Yue et al. 2024), and MMB denotes the english subset of MMBench(Liu et al. 2023). ... Coco2017 refers to validation subset of COCO2017 caption(Chen et al. 2015). Flickr30k and Nocaps are validation and test splits in the original datasets(Plummer et al. 2015; Agrawal et al. 2019). |
| Hardware Specification | No | The paper discusses FLOPs and latency as performance metrics but does not specify any particular hardware components such as GPU models, CPU types, or memory sizes used for running the experiments. It lacks concrete details like 'NVIDIA A100' or 'Intel Xeon'. |
| Software Dependencies | No | The paper does not provide specific version numbers for any software libraries, frameworks, or programming languages used in the implementation of the proposed method or experiments. There is no mention of details like 'Python 3.8' or 'PyTorch 1.9'. |
| Experiment Setup | Yes | Ablation on Progressive Visual Token Pruning We conduct ablation experiments on pruning parameters stride S and pruning ratio R. The proportion of conserved visual tokens in the last layer C can be calculated as: C = 1 (L 3)/S R P (9) where P means pruning ratio in 4-th layer, which is set to 50% inspired by Fast V (Chen et al. 2024b), and is the round-down operation. Tab. 3 shows the results when C = 1% on LLaVA-1.5-7B and C = 5% on LLaVA-1.5-13B. ... A study of the parameter τ is conducted in the cosine function. |