reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Streamline Without Sacrifice - Squeeze out Computation Redundancy in LMM

Authors: Penghao Wu, Lewei Lu, Ziwei Liu

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	To validate our hypothesis regarding computation-level redundancy in decoder-only LMMs, we first design a series of exploratory experiments to investigate the presence of such redundancy in self-attention operations among vision tokens... As shown in Figure 2, directly masking vision token attention across the entire LLM leads to a significant performance drop, while masking it from the middle or later layers has minimal or no effect on performance.
Researcher Affiliation	Collaboration	Penghao Wu 1 Lewei Lu 2 Ziwei Liu 1 1S-Lab, Nanyang Technological University 2Sense Time Research.
Pseudocode	No	The paper describes the proposed algorithm, Proxy V, in detail with figures and textual explanations of its components and operations, but it does not present a formal pseudocode block or algorithm listing.
Open Source Code	No	The code will be made public here.
Open Datasets	Yes	We select a set of OCR-extensive benchmarks (Doc VQA (Mathew et al., 2021), Chart QA (Masry et al., 2022), Info VQA (Mathew et al., 2022), OCRBench (Liu et al., 2024c), Text VQA (Singh etal., 2019))... For the document parsing task, we continue to train the models on the 1M document parsing data from Doc Struct4M (Hu et al., 2024) dataset and evaluate them on the CCpdf (Turski et al., 2023) dateset in the validation split.
Dataset Splits	Yes	For all evaluations, we use the validation splits of Doc VQA (Mathew et al., 2021), Info VQA (Mathew et al., 2022), and Text VQA (Singh et al., 2019). We use the English dev split for MMBench (Liu et al., 2025) and the perception split for MME (Fu et al., 2023)... For the grounding benchmark Ref COCO, we calculate the average of the test A and test B splits.
Hardware Specification	Yes	The reported FLOPs and time for all experiments are measured during the prefilling stage, using a fixed configuration of five image grids (2880 tokens) and 50 text tokens, with eager attention implementation on a single H100 GPU.
Software Dependencies	No	The paper does not explicitly mention specific software dependencies (e.g., PyTorch, TensorFlow) with version numbers.
Experiment Setup	Yes	For all experiments, we use the widely adopted 2-stage training pipeline. For stage 1, we pretrain the multi-modal projector and the newly added vision-specific modules using 1.2M captioning data from Share GPT4V (Chen et al., 2025a) for 1 epoch. For the finetuning stage, we train the model for 1 epoch using the 779K instruction tuning data in LLava Next (Liu et al., 2024a) and unfreeze the LLM in this stage. For our Proxy V implementation, we choose the downsampling factor r = 4 so that 576 full vision tokens are compressed to 36 proxy vision tokens, and each proxy token corresponds to 16 full vision tokens in the guided-update process. For the non-spatial Proxy V version, we set the number of learnable queries to be the same as the spatial version. The hidden dimension in the guided-updated MLP module is set to be 1/4 of the hidden dimension in the LLM. The number of parameters of the newly added guided-update module for each layer is 14.68M for the Vicuna1.5-7B case. For the Vision Zip baseline, we use 360 dominant tokens and 40 contextual tokens. For the Pyramid Drop baseline, the vision token is reduced by 50% after layers 12, 20, and 26.