A-VL: Adaptive Attention for Large Vision-Language Models
Authors: Junyang Zhang, Mu Yuan, Ruiguang Zhong, Puhan Luo, Huiyou Zhan, Ningkang Zhang, Chengchen Hu, Xiang-Yang Li
AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive evaluations on three vision-language tasks and five datasets show the effectiveness of our designs. Our approach A-VL outperforms existing adaptive attention methods in reducing memory usage and computational load without compromising performance. |
| Researcher Affiliation | Collaboration | 1University of Science and Technology of China, Hefei, China 2NIO Inc., Shanghai, China EMAIL, EMAIL, EMAIL, EMAIL, EMAIL |
| Pseudocode | No | The paper describes methods and concepts using text and figures (e.g., Figure 6: The design of adaptive vision attention), but does not contain a specific section or block labeled as 'Pseudocode' or 'Algorithm'. |
| Open Source Code | No | Our method is plug-and-play, requiring no fine-tuning of the original model. Cache selection and eviction can be processed in parallel with the original model inference, and the results are not utilized until the next step. Besides, in our adaptive vision attention method, core caches are selected from secondary caches for computation, as depicted in Figure 7. Slicing the cache before matrix multiplication introduces a performance bottleneck, because the core caches are not stored contiguously in memory (Liu et al. 2023b). Sometimes increasing latency of slicing beyond original matrix multiplication. To address this, we develop a specialized CUDA operator that allows direct multiplication with selected rows or columns of the second matrix, eliminating the need for slicing. |
| Open Datasets | Yes | We randomly select 100 samples from the Flickr30k dataset (Young et al. 2014) that can output longer text sequences. We utilize the attention score to demonstrate the attention allocated to input tokens during the generation of a new token. ... The tasks and datasets used in our evaluation are as follows: Image Caption. This task involves automatically generating textual descriptions for visual content. We employ two datasets, Nocaps (Agrawal et al. 2019) and Flickr30k (Plummer et al. 2015). The metric is the CIDEr score (Vedantam, Lawrence Zitnick, and Parikh 2015). Visual Question Answering (VQA). This task requires models to answer questions based on visual information from images. We use three datasets, Doc VQA (Mathew, Karatzas, and Jawahar 2021), Text VQA (Singh et al. 2019) and VQAv2 (Goyal et al. 2017), measuring performance with the ANLS (Biten et al. 2019) and accuracy metrics. Optical Character Recognition (OCR). OCR involves recognizing text content within images. We select the OCRBench dataset (Liu et al. 2023a), which is specifically designed for LVLMs. |
| Dataset Splits | No | To investigate attention changes in LVLM, we randomly select 100 samples from the Flickr30k dataset (Young et al. 2014) that can output longer text sequences. We quantify the attention score changes in LLa VA-1.5 7B model across three token types during the decode phase. ... We employ multiple vision-language tasks to evaluate our method. ... The tasks and datasets used in our evaluation are as follows: Image Caption. ... Visual Question Answering (VQA). ... Optical Character Recognition (OCR). |
| Hardware Specification | Yes | We measure the latency during the process in Figure 7 with Py Torch on an NVIDIA A40 GPU under the configure of LLa VA-1.6 7B. Latency variations across different batch sizes are illustrated in Figure 8. |
| Software Dependencies | No | We measure the latency during the process in Figure 7 with Py Torch on an NVIDIA A40 GPU under the configure of LLa VA-1.6 7B. Latency variations across different batch sizes are illustrated in Figure 8. |
| Experiment Setup | Yes | We list all the parameters involved in our method and their description in Table 1. ... For the H2O method, set the cache window to 75% recent and 25% heavy hitter. For Fast V, in order to compare performance with the KV cache, we delete redundant tokens according to the Fast V method in the prefill phase, and use the generated KV cache for inference in the decode phase. |