reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Theory, Analysis, and Best Practices for Sigmoid Self-Attention

Authors: Jason Ramapuram, Federico Danieli, Eeshan Gunesh Dhekane, Floris Weers, Dan Busbridge, Pierre Ablin, Tatiana Likhomanenko, Jagrit Digani, Zijin Gu, Amitis Shidani, Russell Webb

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this work, we revisit sigmoid attention and conduct an in-depth theoretical and empirical analysis. Theoretically, we prove that transformers with sigmoid attention are universal function approximators and benefit from improved regularity compared to softmax attention. Through detailed empirical analysis, we identify stabilization of large initial attention norms during the early stages of training as a crucial factor for the successful training of models with sigmoid attention, outperforming prior attempts. We also introduce FLASHSIGMOID, a hardware-aware and memory-efficient implementation of sigmoid attention yielding a 17% inference kernel speed-up over FLASHATTENTION2 on H100 GPUs 1. Experiments across language, vision, and speech show that properly normalized sigmoid attention matches the strong performance of softmax attention on a wide range of domains and scales, which previous attempts at sigmoid attention were unable to fully achieve.
Researcher Affiliation	Industry	Jason Ramapuram , Federico Danieli , Eeshan Dhekane , Floris Weers , Dan Busbridge , Pierre Ablin , Tatiana Likhomanenko , Jagrit Digani, Zijin Gu, Amitis Shidani, Russ Webb Apple Correspondence to: Jason Ramapuram <EMAIL>
Pseudocode	Yes	Algorithm 1 FLASHSIGMOID Forward Pass Algorithm 2 FLASHSIGMOID Backward Pass
Open Source Code	Yes	1Code is available at https://github.com/apple/ml-sigmoid-attention.
Open Datasets	Yes	To empirically validate Sigmoid Attn, we evaluate across several domains: supervised image classification using vision transformers (Dosovitskiy et al., 2021), self-supervised image representation learning with Sim CLR (Chen et al., 2020; Zhai et al., 2023a), Bootstrap Your Own Latent (BYOL) (Grill et al., 2020; Busbridge et al., 2023) and Masked Auto Encoders (MAE) (He et al., 2022) as well as automatic speech recognition (ASR) (Synnaeve et al., 2020; Gulati et al., 2020b) and auto-regressive language modeling (LM) (Brown et al., 2020). We also validate sequence length generalization on TED-LIUM v3 (Hernandez et al., 2018) for ASR and in small scale synthetic experiments in App. G.5.4. Section 5.2 SUPERVISED IMAGE CLASSIFICATION: Image Net dataset (Deng et al., 2009) Section 5.4 AUTOMATIC SPEECH RECOGNITION (ASR): Libri Speech data (Panayotov et al., 2015) Section 5.5 AUTOREGRESSIVE LARGE LANGUAGE MODELING: Red Pajama (Computer, 2023) dataset
Dataset Splits	Yes	We train models until the greedy WER stops improving on the validation sets (dev-clean, dev-other) and report final test sets (test-clean, test-other) greedy WER without integration of any external language model. To perform evaluation on TED-LIUM v3, we combine together validation and test sets of TED-LIUM v3 (we don t use them for training and hyper-parameters search and just perform final evaluation) and split them into 4 datasets according to the duration: 0-10s, 10-20s, 20-30s, and 30s+.
Hardware Specification	Yes	We also introduce FLASHSIGMOID, a hardware-aware and memory-efficient implementation of sigmoid attention yielding a 17% inference kernel speed-up over FLASHATTENTION2 on H100 GPUs To measure the performance improvements of FLASHSIGMOID, we compare the timings of the kernels in its forward and backward passes against those of FLASHATTENTION2. The details of this benchmarking on H100 and A100 GPUs can be found in App. F.2.
Software Dependencies	No	The paper mentions "Py Torch (Paszke et al., 2019)", "JAX without FLASHATTENTION using the AXLearn framework4", and "T5 tokenizer (Raffel et al., 2020)". While these software tools and frameworks are mentioned, specific version numbers required for reproducibility are not provided in the main text or appendices. For example, PyTorch is cited but its version is not specified.
Experiment Setup	Yes	We train a single layer AR transformer block (E=3072, D_FF=12288) on the realnews split of C4 (Raffel et al., 2020). We train for 216 steps using a batch size of 6 and max sequence length of 4096 using a single cycle cosine learning rate (LR) schedule without weight decay. All models are trained with Ro PE with b ln n, using Adam W (Loshchilov & Hutter, 2017) on the realnews split of C4 with (β1, β2) = (0.9, 0.95), ϵ = 10 8, wd = 0, batch size 24, maximum token sequence length of 512 from the T5 tokenizer (Raffel et al., 2020), cosine LR schedule of 214 steps including a linear warmup of 210 steps. Appendix G.2.4: Tables 7 and 8 list detailed hyperparameters for vision models (Sim CLR, BYOL, Supervised ViT, MAE ViT). Appendix G.3.1: Table 9 lists training details for Language Model (1B and 7B). Appendix G.4.1: Table 11 lists training details for ASR models.