SparseVLM: Visual Token Sparsification for Efficient Vision-Language Model Inference
Authors: Yuan Zhang, Chun-Kai Fan, Junpeng Ma, Wenzhao Zheng, Tao Huang, Kuan Cheng, Denis A Gudovskiy, Tomoyuki Okuno, Yohei Nakata, Kurt Keutzer, Shanghang Zhang
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experimental results show that Sparse VLM increases the efficiency of various VLMs in a number of image and video understanding tasks. Our code is available at https: //github.com/Gumpest/Sparse VLMs. Extensive experiments demonstrate that our Sparse VLM effectively reduces computational overhead of various VLMs without sacrificing their performance in a wide range of image and video understanding tasks. |
| Researcher Affiliation | Collaboration | *Equal contribution 1State Key Laboratory of Multimedia Information Processing, School of Computer Science, Peking University 2Fudan University 3EECS, UC Berkeley 4Shanghai Jiao Tong University 5Panasonic Holdings Corporation. Correspondence to: Wenzhao Zheng <EMAIL>, Shanghang Zhang <EMAIL>. |
| Pseudocode | No | The paper describes the methodology in Section 3 and Appendix B with equations and descriptive text, but it does not present any clearly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | Our code is available at https: //github.com/Gumpest/Sparse VLMs. |
| Open Datasets | Yes | For image-based multimodal evaluation, we conduct experiments on eight widely adopted benchmarks, including GQA (Hudson & Manning, 2019), MMBench (MMB) (Liu et al., 2024c), MME (Fu et al., 2023), POPE (Li et al., 2023b), SQA (Lu et al., 2022), SEED-Bench (SEED) (Li et al., 2024a), VQAText (Text VQA) (Singh et al., 2019), and MMVet (Yu et al., 2024). We test on four common video question answering benchmarks, TGIF-QA (Jang et al., 2017), MSVD-QA (Xu et al., 2017), MSRVTT-QA (Xu et al., 2017), and Activity Net-QA (Yu et al., 2019). |
| Dataset Splits | Yes | We set 3 vision token count configurations (192, 128, and 64) to check the advantages of Sparse VLM comprehensively. When pruning from 576 to 192 tokens... To make a fair comparison, we both preserve 194 vision tokens (90.5% pruning ratio) for Fast V (Chen et al., 2024a) and Sparse VLM. Specifically, following Fast V’s (Chen et al., 2024a) setup, we use the first 1000 samples per benchmark and score them using the Video-Chat GPT (Maaz et al., 2024) evaluation tool, acknowledging the characteristic length imbalances in these datasets. |
| Hardware Specification | Yes | All of our experiments are conducted on a single NVIDIA A100-80G GPU. |
| Software Dependencies | Yes | The implementation is carried out in Python 3.10, utilizing Py Torch 2.1.2, CUDA 11.8, and transformers 4.31.0. |
| Experiment Setup | Yes | We set 3 vision token count configurations (192, 128, and 64) to check the advantages of Sparse VLM comprehensively. For LLa VA-1.5-7/13B, Mini-Gemini (MGM), and Qwen-VL, we follow the same inference setting as the original paper as it is publicly available. For video understanding tasks, we adopt the same inference setup as the original Video-LLa VA code base. |