reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

A-VL: Adaptive Attention for Large Vision-Language Models

Authors: Junyang Zhang, Mu Yuan, Ruiguang Zhong, Puhan Luo, Huiyou Zhan, Ningkang Zhang, Chengchen Hu, Xiang-Yang Li

AAAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive evaluations on three vision-language tasks and five datasets show the effectiveness of our designs. Our approach A-VL outperforms existing adaptive attention methods in reducing memory usage and computational load without compromising performance.
Researcher Affiliation	Collaboration	1University of Science and Technology of China, Hefei, China 2NIO Inc., Shanghai, China EMAIL, EMAIL, EMAIL, EMAIL, EMAIL
Pseudocode	No	The paper describes methods and concepts using text and figures (e.g., Figure 6: The design of adaptive vision attention), but does not contain a specific section or block labeled as 'Pseudocode' or 'Algorithm'.
Open Source Code	No	Our method is plug-and-play, requiring no fine-tuning of the original model. Cache selection and eviction can be processed in parallel with the original model inference, and the results are not utilized until the next step. Besides, in our adaptive vision attention method, core caches are selected from secondary caches for computation, as depicted in Figure 7. Slicing the cache before matrix multiplication introduces a performance bottleneck, because the core caches are not stored contiguously in memory (Liu et al. 2023b). Sometimes increasing latency of slicing beyond original matrix multiplication. To address this, we develop a specialized CUDA operator that allows direct multiplication with selected rows or columns of the second matrix, eliminating the need for slicing.
Open Datasets	Yes	We randomly select 100 samples from the Flickr30k dataset (Young et al. 2014) that can output longer text sequences. We utilize the attention score to demonstrate the attention allocated to input tokens during the generation of a new token. ... The tasks and datasets used in our evaluation are as follows: Image Caption. This task involves automatically generating textual descriptions for visual content. We employ two datasets, Nocaps (Agrawal et al. 2019) and Flickr30k (Plummer et al. 2015). The metric is the CIDEr score (Vedantam, Lawrence Zitnick, and Parikh 2015). Visual Question Answering (VQA). This task requires models to answer questions based on visual information from images. We use three datasets, Doc VQA (Mathew, Karatzas, and Jawahar 2021), Text VQA (Singh et al. 2019) and VQAv2 (Goyal et al. 2017), measuring performance with the ANLS (Biten et al. 2019) and accuracy metrics. Optical Character Recognition (OCR). OCR involves recognizing text content within images. We select the OCRBench dataset (Liu et al. 2023a), which is specifically designed for LVLMs.
Dataset Splits	No	To investigate attention changes in LVLM, we randomly select 100 samples from the Flickr30k dataset (Young et al. 2014) that can output longer text sequences. We quantify the attention score changes in LLa VA-1.5 7B model across three token types during the decode phase. ... We employ multiple vision-language tasks to evaluate our method. ... The tasks and datasets used in our evaluation are as follows: Image Caption. ... Visual Question Answering (VQA). ... Optical Character Recognition (OCR).
Hardware Specification	Yes	We measure the latency during the process in Figure 7 with Py Torch on an NVIDIA A40 GPU under the configure of LLa VA-1.6 7B. Latency variations across different batch sizes are illustrated in Figure 8.
Software Dependencies	No	We measure the latency during the process in Figure 7 with Py Torch on an NVIDIA A40 GPU under the configure of LLa VA-1.6 7B. Latency variations across different batch sizes are illustrated in Figure 8.
Experiment Setup	Yes	We list all the parameters involved in our method and their description in Table 1. ... For the H2O method, set the cache window to 75% recent and 25% heavy hitter. For Fast V, in order to compare performance with the KV cache, we delete redundant tokens according to the Fast V method in the prefill phase, and use the generated KV cache for inference in the decode phase.