Streamline Without Sacrifice - Squeeze out Computation Redundancy in LMM
Authors: Penghao Wu, Lewei Lu, Ziwei Liu
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | To validate our hypothesis regarding computation-level redundancy in decoder-only LMMs, we first design a series of exploratory experiments to investigate the presence of such redundancy in self-attention operations among vision tokens... As shown in Figure 2, directly masking vision token attention across the entire LLM leads to a significant performance drop, while masking it from the middle or later layers has minimal or no effect on performance. |
| Researcher Affiliation | Collaboration | Penghao Wu 1 Lewei Lu 2 Ziwei Liu 1 1S-Lab, Nanyang Technological University 2Sense Time Research. |
| Pseudocode | No | The paper describes the proposed algorithm, Proxy V, in detail with figures and textual explanations of its components and operations, but it does not present a formal pseudocode block or algorithm listing. |
| Open Source Code | No | The code will be made public here. |
| Open Datasets | Yes | We select a set of OCR-extensive benchmarks (Doc VQA (Mathew et al., 2021), Chart QA (Masry et al., 2022), Info VQA (Mathew et al., 2022), OCRBench (Liu et al., 2024c), Text VQA (Singh etal., 2019))... For the document parsing task, we continue to train the models on the 1M document parsing data from Doc Struct4M (Hu et al., 2024) dataset and evaluate them on the CCpdf (Turski et al., 2023) dateset in the validation split. |
| Dataset Splits | Yes | For all evaluations, we use the validation splits of Doc VQA (Mathew et al., 2021), Info VQA (Mathew et al., 2022), and Text VQA (Singh et al., 2019). We use the English dev split for MMBench (Liu et al., 2025) and the perception split for MME (Fu et al., 2023)... For the grounding benchmark Ref COCO, we calculate the average of the test A and test B splits. |
| Hardware Specification | Yes | The reported FLOPs and time for all experiments are measured during the prefilling stage, using a fixed configuration of five image grids (2880 tokens) and 50 text tokens, with eager attention implementation on a single H100 GPU. |
| Software Dependencies | No | The paper does not explicitly mention specific software dependencies (e.g., PyTorch, TensorFlow) with version numbers. |
| Experiment Setup | Yes | For all experiments, we use the widely adopted 2-stage training pipeline. For stage 1, we pretrain the multi-modal projector and the newly added vision-specific modules using 1.2M captioning data from Share GPT4V (Chen et al., 2025a) for 1 epoch. For the finetuning stage, we train the model for 1 epoch using the 779K instruction tuning data in LLava Next (Liu et al., 2024a) and unfreeze the LLM in this stage. For our Proxy V implementation, we choose the downsampling factor r = 4 so that 576 full vision tokens are compressed to 36 proxy vision tokens, and each proxy token corresponds to 16 full vision tokens in the guided-update process. For the non-spatial Proxy V version, we set the number of learnable queries to be the same as the spatial version. The hidden dimension in the guided-updated MLP module is set to be 1/4 of the hidden dimension in the LLM. The number of parameters of the newly added guided-update module for each layer is 14.68M for the Vicuna1.5-7B case. For the Vision Zip baseline, we use 360 dominant tokens and 40 contextual tokens. For the Pyramid Drop baseline, the vision token is reduced by 50% after layers 12, 20, and 26. |