HiRED: Attention-Guided Token Dropping for Efficient Inference of High-Resolution Vision-Language Models

Authors: Kazi Hasan Ibn Arif, JinYi Yoon, Dimitrios S. Nikolopoulos, Hans Vandierendonck, Deepu John, Bo Ji

AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Empirically, Hi RED-20% (i.e., a 20% token budget) on LLa VA-Next-7B achieves a 4.7 increase in token generation throughput, reduces response latency by 78%, and saves 14% of GPU memory for single inference on an NVIDIA TESLA P40 (24 GB). For larger batch sizes (e.g., 4), Hi RED-20% prevents out-of-memory errors by cutting memory usage by 30%, while preserving throughput and latency benefits. ... 5 Evaluation We evaluate Hi RED on LLa VA-Next (Liu et al. 2024a), LLa VA-v1.5 (Liu et al. 2023), and Share GPT4V (Chen et al. 2025a).
Researcher Affiliation Academia 1Virginia Tech, Blacksburg, VA, USA 2Queen s University Belfast, Belfast, UK 3University College Dublin, Dublin, Ireland EMAIL, EMAIL, EMAIL, EMAIL
Pseudocode Yes Algorithm 1: Hi RED 1: Input: Nbudget, NVi T, α, k, linit, lfinal, H, Tpi, {api l,h[j]}.
Open Source Code Yes Code https://github.com/hasanar1f/Hi RED
Open Datasets Yes We used eight benchmarks from LMMS-EVAL (Zhang et al. 2024b) evaluation framework across three different task types: 1) Visual Question Answering (VQA) includes high-level object recognition benchmarks such as VQA-v2 (Goyal et al. 2017) and Science QA (Lu et al. 2022); 2) Transcription focuses on fine-grained transcription tasks, including Text VQA (Singh et al. 2019), Doc VQA (Mathew, Karatzas, and Jawahar 2021), and OCRBench (Liu et al. 2024b); and 3) Others consists of MME (Fu et al. 2024) for perception and cognition abilities, POPE (Li et al. 2023b) for hallucination detection and Chart QA (Masry et al. 2022).
Dataset Splits No The paper mentions several benchmarks like VQA-v2, Science QA, Text VQA, Doc VQA, OCRBench, MME, POPE, and Chart QA but does not explicitly provide details about training/test/validation splits for these datasets within the paper's text.
Hardware Specification Yes for single inference on an NVIDIA TESLA P40 (24 GB). For performance evaluation, we use an entry-level NVIDIA TESLA P40 (24 GB) GPU.
Software Dependencies No The paper mentions using LLa VA-Next, LLa VA-v1.5, and Share GPT4V models but does not provide specific version numbers for underlying software dependencies like Python, PyTorch, or CUDA.
Experiment Setup Yes Empirically, Hi RED-20% (i.e., a 20% token budget) on LLa VA-Next-7B achieves a 4.7 increase in token generation throughput, reduces response latency by 78%, and saves 14% of GPU memory for single inference on an NVIDIA TESLA P40 (24 GB). ... Therefore, we choose α = 0.5 as the default value for allocating the token budget between the full-image and sub-images. ... We use the CLS-attention from the initial Vi T layer (linit = 0) to allocate the token budget. ... Specifically, we add CLS-attention of the final layer (lfinal = 22 across all heads.