SparseVLM: Visual Token Sparsification for Efficient Vision-Language Model Inference

Authors: Yuan Zhang, Chun-Kai Fan, Junpeng Ma, Wenzhao Zheng, Tao Huang, Kuan Cheng, Denis A Gudovskiy, Tomoyuki Okuno, Yohei Nakata, Kurt Keutzer, Shanghang Zhang

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimental results show that Sparse VLM increases the efficiency of various VLMs in a number of image and video understanding tasks. Our code is available at https: //github.com/Gumpest/Sparse VLMs. Extensive experiments demonstrate that our Sparse VLM effectively reduces computational overhead of various VLMs without sacrificing their performance in a wide range of image and video understanding tasks.
Researcher Affiliation Collaboration *Equal contribution 1State Key Laboratory of Multimedia Information Processing, School of Computer Science, Peking University 2Fudan University 3EECS, UC Berkeley 4Shanghai Jiao Tong University 5Panasonic Holdings Corporation. Correspondence to: Wenzhao Zheng <EMAIL>, Shanghang Zhang <EMAIL>.
Pseudocode No The paper describes the methodology in Section 3 and Appendix B with equations and descriptive text, but it does not present any clearly labeled pseudocode or algorithm blocks.
Open Source Code Yes Our code is available at https: //github.com/Gumpest/Sparse VLMs.
Open Datasets Yes For image-based multimodal evaluation, we conduct experiments on eight widely adopted benchmarks, including GQA (Hudson & Manning, 2019), MMBench (MMB) (Liu et al., 2024c), MME (Fu et al., 2023), POPE (Li et al., 2023b), SQA (Lu et al., 2022), SEED-Bench (SEED) (Li et al., 2024a), VQAText (Text VQA) (Singh et al., 2019), and MMVet (Yu et al., 2024). We test on four common video question answering benchmarks, TGIF-QA (Jang et al., 2017), MSVD-QA (Xu et al., 2017), MSRVTT-QA (Xu et al., 2017), and Activity Net-QA (Yu et al., 2019).
Dataset Splits Yes We set 3 vision token count configurations (192, 128, and 64) to check the advantages of Sparse VLM comprehensively. When pruning from 576 to 192 tokens... To make a fair comparison, we both preserve 194 vision tokens (90.5% pruning ratio) for Fast V (Chen et al., 2024a) and Sparse VLM. Specifically, following Fast V’s (Chen et al., 2024a) setup, we use the first 1000 samples per benchmark and score them using the Video-Chat GPT (Maaz et al., 2024) evaluation tool, acknowledging the characteristic length imbalances in these datasets.
Hardware Specification Yes All of our experiments are conducted on a single NVIDIA A100-80G GPU.
Software Dependencies Yes The implementation is carried out in Python 3.10, utilizing Py Torch 2.1.2, CUDA 11.8, and transformers 4.31.0.
Experiment Setup Yes We set 3 vision token count configurations (192, 128, and 64) to check the advantages of Sparse VLM comprehensively. For LLa VA-1.5-7/13B, Mini-Gemini (MGM), and Qwen-VL, we follow the same inference setting as the original paper as it is publicly available. For video understanding tasks, we adopt the same inference setup as the original Video-LLa VA code base.