reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

AdaSplash: Adaptive Sparse Flash Attention

Authors: Nuno Gonçalves, Marcos V Treviso, Andre Martins

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experiments with RoBERTa and Modern BERT for text classification and single-vector retrieval, along with GPT-2 for language modeling, show that our method achieves substantial improvements in runtime and memory efficiency compared to existing α-entmax implementations.
Researcher Affiliation	Collaboration	1Instituto Superior Técnico, Universidade de Lisboa, Portugal 2Instituto de Telecomunicações, Lisbon, Portugal 3Unbabel, Lisbon, Portugal. Correspondence to: Nuno Gonçalves <EMAIL>.
Pseudocode	Yes	Algorithm 1 Halley-bisection algorithm for α-entmax. Algorithm 2 ADASPLASH forward pass (w/o masking) Algorithm 3 Halley-bisection for computing τ Block Version Algorithm 4 ADASPLASH Backward Pass for d K and d V Algorithm 5 ADASPLASH Backward Pass for d Q
Open Source Code	Yes	1Code: https://github.com/deep-spin/adasplash
Open Datasets	Yes	We conducted continuous pretraining of Ro BERTa-base and Modern BERT-base on 2B tokens of the English subset of Fineweb-edu (Lozhkov et al., 2024)... We evaluate our pretrained models on single-vector retrieval performance using the BEIR benchmark (Sci Fact, NFCorpus, Fi QA2018, TREC-COVID)... We fine-tuned a pretrained Ro BERTa model (Liu et al., 2019) on the ECt HR (Chalkidis et al., 2019; 2021) dataset... We also evaluate Ro BERTa and Modern BERT models with α-entmax attention on the GLUE benchmark (Wang et al., 2018)... we trained a small 124M GPT-2 model (Radford et al., 2019) from scratch on 10B tokens of the Fine Web dataset (Penedo et al., 2024)
Dataset Splits	No	The paper describes datasets used (Fineweb-edu, BEIR, ECtHR, GLUE, Fine Web) and mentions context lengths (e.g., 128 tokens for GLUE, 1024 tokens for GPT-2) and training/validation for some, but does not provide explicit split percentages, sample counts, or specific methodology for dividing these datasets into training, validation, and test sets. It often refers to 'following the setup' of other papers or using 'default hyperparameters' which implies standard splits, but these are not explicitly detailed within the paper itself.
Hardware Specification	Yes	Experiments on masked language modeling, text classification, GLUE tasks and BIER tasks were carried on Nvidia RTX A6000 GPUs with 48GB VRAM. Experiments with GPT-2 and the efficiency benchmark in Figures 1 and 3 were carried on a single Nvidia H100 GPU (80GB). The runtime experiments with Modern BERT were carried on a single A6000 GPU.
Software Dependencies	No	We used the Hugging Face Transformers library for model training and implementation and the Datasets library for data handling. The models were evaluated on BEIR tasks using the MTEB benchmark toolkit. We trained both the standard GPT-2 model and sparse GPT-2 (α = 1.5) using the configuration provided in the llm.c repository. The paper mentions these software components but does not provide specific version numbers for any of them (e.g., PyTorch version, Transformers version).
Experiment Setup	Yes	Concretely, we used a batch size of 32 and a learning rate of 5e-5, optimized with the Adam W optimizer. Training was conducted for 100,000 steps using mixed-precision (fp16). The sparsity parameter (α) was initialized at 1.01 and annealed linearly to a final value of 1.5 or 2.0 over 50,000 steps... We use an effective batch size of 512, and use gradient accumulation to fit into available GPU memory. We use the Adam W optimizer, with learning rate 6e-4 and weight decay of 0.1. The learning rate followed a warm-up phase, linearly ramping from zero to a maximum of 6e-4 over the first 700 iterations, equivalent to 350 million tokens. Subsequently, the learning rate decayed to zero across the remaining training steps.