A-VL: Adaptive Attention for Large Vision-Language Models

Authors: Junyang Zhang, Mu Yuan, Ruiguang Zhong, Puhan Luo, Huiyou Zhan, Ningkang Zhang, Chengchen Hu, Xiang-Yang Li

AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive evaluations on three vision-language tasks and five datasets show the effectiveness of our designs. Our approach A-VL outperforms existing adaptive attention methods in reducing memory usage and computational load without compromising performance.
Researcher Affiliation Collaboration 1University of Science and Technology of China, Hefei, China 2NIO Inc., Shanghai, China EMAIL, EMAIL, EMAIL, EMAIL, EMAIL
Pseudocode No The paper describes methods and concepts using text and figures (e.g., Figure 6: The design of adaptive vision attention), but does not contain a specific section or block labeled as 'Pseudocode' or 'Algorithm'.
Open Source Code No Our method is plug-and-play, requiring no fine-tuning of the original model. Cache selection and eviction can be processed in parallel with the original model inference, and the results are not utilized until the next step. Besides, in our adaptive vision attention method, core caches are selected from secondary caches for computation, as depicted in Figure 7. Slicing the cache before matrix multiplication introduces a performance bottleneck, because the core caches are not stored contiguously in memory (Liu et al. 2023b). Sometimes increasing latency of slicing beyond original matrix multiplication. To address this, we develop a specialized CUDA operator that allows direct multiplication with selected rows or columns of the second matrix, eliminating the need for slicing.
Open Datasets Yes We randomly select 100 samples from the Flickr30k dataset (Young et al. 2014) that can output longer text sequences. We utilize the attention score to demonstrate the attention allocated to input tokens during the generation of a new token. ... The tasks and datasets used in our evaluation are as follows: Image Caption. This task involves automatically generating textual descriptions for visual content. We employ two datasets, Nocaps (Agrawal et al. 2019) and Flickr30k (Plummer et al. 2015). The metric is the CIDEr score (Vedantam, Lawrence Zitnick, and Parikh 2015). Visual Question Answering (VQA). This task requires models to answer questions based on visual information from images. We use three datasets, Doc VQA (Mathew, Karatzas, and Jawahar 2021), Text VQA (Singh et al. 2019) and VQAv2 (Goyal et al. 2017), measuring performance with the ANLS (Biten et al. 2019) and accuracy metrics. Optical Character Recognition (OCR). OCR involves recognizing text content within images. We select the OCRBench dataset (Liu et al. 2023a), which is specifically designed for LVLMs.
Dataset Splits No To investigate attention changes in LVLM, we randomly select 100 samples from the Flickr30k dataset (Young et al. 2014) that can output longer text sequences. We quantify the attention score changes in LLa VA-1.5 7B model across three token types during the decode phase. ... We employ multiple vision-language tasks to evaluate our method. ... The tasks and datasets used in our evaluation are as follows: Image Caption. ... Visual Question Answering (VQA). ... Optical Character Recognition (OCR).
Hardware Specification Yes We measure the latency during the process in Figure 7 with Py Torch on an NVIDIA A40 GPU under the configure of LLa VA-1.6 7B. Latency variations across different batch sizes are illustrated in Figure 8.
Software Dependencies No We measure the latency during the process in Figure 7 with Py Torch on an NVIDIA A40 GPU under the configure of LLa VA-1.6 7B. Latency variations across different batch sizes are illustrated in Figure 8.
Experiment Setup Yes We list all the parameters involved in our method and their description in Table 1. ... For the H2O method, set the cache window to 75% recent and 25% heavy hitter. For Fast V, in order to compare performance with the KV cache, we delete redundant tokens according to the Fast V method in the prefill phase, and use the generated KV cache for inference in the decode phase.