HiRED: Attention-Guided Token Dropping for Efficient Inference of High-Resolution Vision-Language Models
Authors: Kazi Hasan Ibn Arif, JinYi Yoon, Dimitrios S. Nikolopoulos, Hans Vandierendonck, Deepu John, Bo Ji
AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Empirically, Hi RED-20% (i.e., a 20% token budget) on LLa VA-Next-7B achieves a 4.7 increase in token generation throughput, reduces response latency by 78%, and saves 14% of GPU memory for single inference on an NVIDIA TESLA P40 (24 GB). For larger batch sizes (e.g., 4), Hi RED-20% prevents out-of-memory errors by cutting memory usage by 30%, while preserving throughput and latency benefits. ... 5 Evaluation We evaluate Hi RED on LLa VA-Next (Liu et al. 2024a), LLa VA-v1.5 (Liu et al. 2023), and Share GPT4V (Chen et al. 2025a). |
| Researcher Affiliation | Academia | 1Virginia Tech, Blacksburg, VA, USA 2Queen s University Belfast, Belfast, UK 3University College Dublin, Dublin, Ireland EMAIL, EMAIL, EMAIL, EMAIL |
| Pseudocode | Yes | Algorithm 1: Hi RED 1: Input: Nbudget, NVi T, α, k, linit, lfinal, H, Tpi, {api l,h[j]}. |
| Open Source Code | Yes | Code https://github.com/hasanar1f/Hi RED |
| Open Datasets | Yes | We used eight benchmarks from LMMS-EVAL (Zhang et al. 2024b) evaluation framework across three different task types: 1) Visual Question Answering (VQA) includes high-level object recognition benchmarks such as VQA-v2 (Goyal et al. 2017) and Science QA (Lu et al. 2022); 2) Transcription focuses on fine-grained transcription tasks, including Text VQA (Singh et al. 2019), Doc VQA (Mathew, Karatzas, and Jawahar 2021), and OCRBench (Liu et al. 2024b); and 3) Others consists of MME (Fu et al. 2024) for perception and cognition abilities, POPE (Li et al. 2023b) for hallucination detection and Chart QA (Masry et al. 2022). |
| Dataset Splits | No | The paper mentions several benchmarks like VQA-v2, Science QA, Text VQA, Doc VQA, OCRBench, MME, POPE, and Chart QA but does not explicitly provide details about training/test/validation splits for these datasets within the paper's text. |
| Hardware Specification | Yes | for single inference on an NVIDIA TESLA P40 (24 GB). For performance evaluation, we use an entry-level NVIDIA TESLA P40 (24 GB) GPU. |
| Software Dependencies | No | The paper mentions using LLa VA-Next, LLa VA-v1.5, and Share GPT4V models but does not provide specific version numbers for underlying software dependencies like Python, PyTorch, or CUDA. |
| Experiment Setup | Yes | Empirically, Hi RED-20% (i.e., a 20% token budget) on LLa VA-Next-7B achieves a 4.7 increase in token generation throughput, reduces response latency by 78%, and saves 14% of GPU memory for single inference on an NVIDIA TESLA P40 (24 GB). ... Therefore, we choose α = 0.5 as the default value for allocating the token budget between the full-image and sub-images. ... We use the CLS-attention from the initial Vi T layer (linit = 0) to allocate the token budget. ... Specifically, we add CLS-attention of the final layer (lfinal = 22 across all heads. |