reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

HiRED: Attention-Guided Token Dropping for Efficient Inference of High-Resolution Vision-Language Models

Authors: Kazi Hasan Ibn Arif, JinYi Yoon, Dimitrios S. Nikolopoulos, Hans Vandierendonck, Deepu John, Bo Ji

AAAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Empirically, Hi RED-20% (i.e., a 20% token budget) on LLa VA-Next-7B achieves a 4.7 increase in token generation throughput, reduces response latency by 78%, and saves 14% of GPU memory for single inference on an NVIDIA TESLA P40 (24 GB). For larger batch sizes (e.g., 4), Hi RED-20% prevents out-of-memory errors by cutting memory usage by 30%, while preserving throughput and latency benefits. ... 5 Evaluation We evaluate Hi RED on LLa VA-Next (Liu et al. 2024a), LLa VA-v1.5 (Liu et al. 2023), and Share GPT4V (Chen et al. 2025a).
Researcher Affiliation	Academia	1Virginia Tech, Blacksburg, VA, USA 2Queen s University Belfast, Belfast, UK 3University College Dublin, Dublin, Ireland EMAIL, EMAIL, EMAIL, EMAIL
Pseudocode	Yes	Algorithm 1: Hi RED 1: Input: Nbudget, NVi T, α, k, linit, lfinal, H, Tpi, {api l,h[j]}.
Open Source Code	Yes	Code https://github.com/hasanar1f/Hi RED
Open Datasets	Yes	We used eight benchmarks from LMMS-EVAL (Zhang et al. 2024b) evaluation framework across three different task types: 1) Visual Question Answering (VQA) includes high-level object recognition benchmarks such as VQA-v2 (Goyal et al. 2017) and Science QA (Lu et al. 2022); 2) Transcription focuses on fine-grained transcription tasks, including Text VQA (Singh et al. 2019), Doc VQA (Mathew, Karatzas, and Jawahar 2021), and OCRBench (Liu et al. 2024b); and 3) Others consists of MME (Fu et al. 2024) for perception and cognition abilities, POPE (Li et al. 2023b) for hallucination detection and Chart QA (Masry et al. 2022).
Dataset Splits	No	The paper mentions several benchmarks like VQA-v2, Science QA, Text VQA, Doc VQA, OCRBench, MME, POPE, and Chart QA but does not explicitly provide details about training/test/validation splits for these datasets within the paper's text.
Hardware Specification	Yes	for single inference on an NVIDIA TESLA P40 (24 GB). For performance evaluation, we use an entry-level NVIDIA TESLA P40 (24 GB) GPU.
Software Dependencies	No	The paper mentions using LLa VA-Next, LLa VA-v1.5, and Share GPT4V models but does not provide specific version numbers for underlying software dependencies like Python, PyTorch, or CUDA.
Experiment Setup	Yes	Empirically, Hi RED-20% (i.e., a 20% token budget) on LLa VA-Next-7B achieves a 4.7 increase in token generation throughput, reduces response latency by 78%, and saves 14% of GPU memory for single inference on an NVIDIA TESLA P40 (24 GB). ... Therefore, we choose α = 0.5 as the default value for allocating the token budget between the full-image and sub-images. ... We use the CLS-attention from the initial Vi T layer (linit = 0) to allocate the token budget. ... Specifically, we add CLS-attention of the final layer (lfinal = 22 across all heads.