AdaSplash: Adaptive Sparse Flash Attention
Authors: Nuno Gonçalves, Marcos V Treviso, Andre Martins
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments with RoBERTa and Modern BERT for text classification and single-vector retrieval, along with GPT-2 for language modeling, show that our method achieves substantial improvements in runtime and memory efficiency compared to existing α-entmax implementations. |
| Researcher Affiliation | Collaboration | 1Instituto Superior Técnico, Universidade de Lisboa, Portugal 2Instituto de Telecomunicações, Lisbon, Portugal 3Unbabel, Lisbon, Portugal. Correspondence to: Nuno Gonçalves <EMAIL>. |
| Pseudocode | Yes | Algorithm 1 Halley-bisection algorithm for α-entmax. Algorithm 2 ADASPLASH forward pass (w/o masking) Algorithm 3 Halley-bisection for computing τ Block Version Algorithm 4 ADASPLASH Backward Pass for d K and d V Algorithm 5 ADASPLASH Backward Pass for d Q |
| Open Source Code | Yes | 1Code: https://github.com/deep-spin/adasplash |
| Open Datasets | Yes | We conducted continuous pretraining of Ro BERTa-base and Modern BERT-base on 2B tokens of the English subset of Fineweb-edu (Lozhkov et al., 2024)... We evaluate our pretrained models on single-vector retrieval performance using the BEIR benchmark (Sci Fact, NFCorpus, Fi QA2018, TREC-COVID)... We fine-tuned a pretrained Ro BERTa model (Liu et al., 2019) on the ECt HR (Chalkidis et al., 2019; 2021) dataset... We also evaluate Ro BERTa and Modern BERT models with α-entmax attention on the GLUE benchmark (Wang et al., 2018)... we trained a small 124M GPT-2 model (Radford et al., 2019) from scratch on 10B tokens of the Fine Web dataset (Penedo et al., 2024) |
| Dataset Splits | No | The paper describes datasets used (Fineweb-edu, BEIR, ECtHR, GLUE, Fine Web) and mentions context lengths (e.g., 128 tokens for GLUE, 1024 tokens for GPT-2) and training/validation for some, but does not provide explicit split percentages, sample counts, or specific methodology for dividing these datasets into training, validation, and test sets. It often refers to 'following the setup' of other papers or using 'default hyperparameters' which implies standard splits, but these are not explicitly detailed within the paper itself. |
| Hardware Specification | Yes | Experiments on masked language modeling, text classification, GLUE tasks and BIER tasks were carried on Nvidia RTX A6000 GPUs with 48GB VRAM. Experiments with GPT-2 and the efficiency benchmark in Figures 1 and 3 were carried on a single Nvidia H100 GPU (80GB). The runtime experiments with Modern BERT were carried on a single A6000 GPU. |
| Software Dependencies | No | We used the Hugging Face Transformers library for model training and implementation and the Datasets library for data handling. The models were evaluated on BEIR tasks using the MTEB benchmark toolkit. We trained both the standard GPT-2 model and sparse GPT-2 (α = 1.5) using the configuration provided in the llm.c repository. The paper mentions these software components but does not provide specific version numbers for any of them (e.g., PyTorch version, Transformers version). |
| Experiment Setup | Yes | Concretely, we used a batch size of 32 and a learning rate of 5e-5, optimized with the Adam W optimizer. Training was conducted for 100,000 steps using mixed-precision (fp16). The sparsity parameter (α) was initialized at 1.01 and annealed linearly to a final value of 1.5 or 2.0 over 50,000 steps... We use an effective batch size of 512, and use gradient accumulation to fit into available GPU memory. We use the Adam W optimizer, with learning rate 6e-4 and weight decay of 0.1. The learning rate followed a warm-up phase, linearly ramping from zero to a maximum of 6e-4 over the first 700 iterations, equivalent to 350 million tokens. Subsequently, the learning rate decayed to zero across the remaining training steps. |