AdaSplash: Adaptive Sparse Flash Attention

Authors: Nuno Gonçalves, Marcos V Treviso, Andre Martins

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments with RoBERTa and Modern BERT for text classification and single-vector retrieval, along with GPT-2 for language modeling, show that our method achieves substantial improvements in runtime and memory efficiency compared to existing α-entmax implementations.
Researcher Affiliation Collaboration 1Instituto Superior Técnico, Universidade de Lisboa, Portugal 2Instituto de Telecomunicações, Lisbon, Portugal 3Unbabel, Lisbon, Portugal. Correspondence to: Nuno Gonçalves <EMAIL>.
Pseudocode Yes Algorithm 1 Halley-bisection algorithm for α-entmax. Algorithm 2 ADASPLASH forward pass (w/o masking) Algorithm 3 Halley-bisection for computing τ Block Version Algorithm 4 ADASPLASH Backward Pass for d K and d V Algorithm 5 ADASPLASH Backward Pass for d Q
Open Source Code Yes 1Code: https://github.com/deep-spin/adasplash
Open Datasets Yes We conducted continuous pretraining of Ro BERTa-base and Modern BERT-base on 2B tokens of the English subset of Fineweb-edu (Lozhkov et al., 2024)... We evaluate our pretrained models on single-vector retrieval performance using the BEIR benchmark (Sci Fact, NFCorpus, Fi QA2018, TREC-COVID)... We fine-tuned a pretrained Ro BERTa model (Liu et al., 2019) on the ECt HR (Chalkidis et al., 2019; 2021) dataset... We also evaluate Ro BERTa and Modern BERT models with α-entmax attention on the GLUE benchmark (Wang et al., 2018)... we trained a small 124M GPT-2 model (Radford et al., 2019) from scratch on 10B tokens of the Fine Web dataset (Penedo et al., 2024)
Dataset Splits No The paper describes datasets used (Fineweb-edu, BEIR, ECtHR, GLUE, Fine Web) and mentions context lengths (e.g., 128 tokens for GLUE, 1024 tokens for GPT-2) and training/validation for some, but does not provide explicit split percentages, sample counts, or specific methodology for dividing these datasets into training, validation, and test sets. It often refers to 'following the setup' of other papers or using 'default hyperparameters' which implies standard splits, but these are not explicitly detailed within the paper itself.
Hardware Specification Yes Experiments on masked language modeling, text classification, GLUE tasks and BIER tasks were carried on Nvidia RTX A6000 GPUs with 48GB VRAM. Experiments with GPT-2 and the efficiency benchmark in Figures 1 and 3 were carried on a single Nvidia H100 GPU (80GB). The runtime experiments with Modern BERT were carried on a single A6000 GPU.
Software Dependencies No We used the Hugging Face Transformers library for model training and implementation and the Datasets library for data handling. The models were evaluated on BEIR tasks using the MTEB benchmark toolkit. We trained both the standard GPT-2 model and sparse GPT-2 (α = 1.5) using the configuration provided in the llm.c repository. The paper mentions these software components but does not provide specific version numbers for any of them (e.g., PyTorch version, Transformers version).
Experiment Setup Yes Concretely, we used a batch size of 32 and a learning rate of 5e-5, optimized with the Adam W optimizer. Training was conducted for 100,000 steps using mixed-precision (fp16). The sparsity parameter (α) was initialized at 1.01 and annealed linearly to a final value of 1.5 or 2.0 over 50,000 steps... We use an effective batch size of 512, and use gradient accumulation to fit into available GPU memory. We use the Adam W optimizer, with learning rate 6e-4 and weight decay of 0.1. The learning rate followed a warm-up phase, linearly ramping from zero to a maximum of 6e-4 over the first 700 iterations, equivalent to 350 million tokens. Subsequently, the learning rate decayed to zero across the remaining training steps.