LevAttention: Time, Space and Streaming Efficient Algorithm for Heavy Attentions
Authors: Ravindran Kannan, Chiranjib Bhattacharyya, Praneeth Kacham, David Woodruff
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We empirically show the benefits of our scheme for vision transformers, showing how to train new models that use our universal set while training as well, showing that our model is able to consistently select important keys during training. We perform experiments on pretrained Vi T models to empirically understand the structure of the attention matrices that arise for typical inputs to a model. We then evaluate the effectiveness of leverage score selection for the downstream task of image classification using the pretrained softmax model. We also trained multiple Vi T models from scratch using the leverage score based attention mechanism and observe that the model quality improves significantly compared to doing inference on the softmax pretrained models using the leverage score mechanism. Across all the models, we observe that the model quality achieves >90% accuracy of the full softmax attention while selecting the top 32 keys (out of 197 keys for L/16 and S/16 models and out of 785 keys for the L/8 model) using the leverage score mechanism at each attention head. |
| Researcher Affiliation | Collaboration | Ravindran Kannan Simons Institute, UC Berkeley EMAIL Chiranjib Bhattacharya Indian Institute of Science EMAIL Praneeth Kacham Google Research EMAIL David P. Woodruff Carnegie Mellon University EMAIL |
| Pseudocode | No | The paper describes algorithms in prose (e.g., "The simplest algorithm is a two-pass algorithm...") but does not include any explicitly labeled pseudocode blocks or algorithms with structured, code-like formatting. |
| Open Source Code | No | The paper does not contain any explicit statements about releasing their source code, nor does it provide any links to a code repository. The text states, "We leave the important question of obtaining better models trained using Lev Attention as a future research direction," which suggests that code for training improved models based on their approach is not yet available. |
| Open Datasets | Yes | We train the models on the Imagenet-1k (Russakovsky et al., 2015) dataset using the same hyperparameters as in the original work (Dosovitskiy et al., 2021). We find that the S/16, L/16 and L/8 models achieve accuracies of 76.47%, 78.83% and 79.47% respectively on the validation split of the Imagenet-1k dataset. |
| Dataset Splits | Yes | We train the models on the Imagenet-1k (Russakovsky et al., 2015) dataset using the same hyperparameters as in the original work (Dosovitskiy et al., 2021). We find that the S/16, L/16 and L/8 models achieve accuracies of 76.47%, 78.83% and 79.47% respectively on the validation split of the Imagenet-1k dataset. |
| Hardware Specification | No | The paper describes the Vi T models used (S/16, L/16, L/8) and their parameter counts and patch sizes. However, it does not specify any particular hardware like GPU models, CPU types, or other computing resources used to run the experiments. |
| Software Dependencies | No | The paper mentions using "the same training setup as the softmax attention models, i.e., the same learning rate schedule, batch sizes, and optimizer," but it does not specify any software names with version numbers (e.g., PyTorch, TensorFlow, Python, CUDA). |
| Experiment Setup | Yes | We train the models on the Imagenet-1k (Russakovsky et al., 2015) dataset using the same hyperparameters as in the original work (Dosovitskiy et al., 2021). Using the same training setup as the softmax attention models, i.e., the same learning rate schedule, batch sizes, and optimizer, we see significant improvements in the validation accuracies. For the L/8 and L/16, we train for the initial 15% of the steps with full attention to obtain the warm start parameters and then train the remaining 85% of the steps using the leverage score selection based attention. |