Theory, Analysis, and Best Practices for Sigmoid Self-Attention

Authors: Jason Ramapuram, Federico Danieli, Eeshan Gunesh Dhekane, Floris Weers, Dan Busbridge, Pierre Ablin, Tatiana Likhomanenko, Jagrit Digani, Zijin Gu, Amitis Shidani, Russell Webb

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this work, we revisit sigmoid attention and conduct an in-depth theoretical and empirical analysis. Theoretically, we prove that transformers with sigmoid attention are universal function approximators and benefit from improved regularity compared to softmax attention. Through detailed empirical analysis, we identify stabilization of large initial attention norms during the early stages of training as a crucial factor for the successful training of models with sigmoid attention, outperforming prior attempts. We also introduce FLASHSIGMOID, a hardware-aware and memory-efficient implementation of sigmoid attention yielding a 17% inference kernel speed-up over FLASHATTENTION2 on H100 GPUs 1. Experiments across language, vision, and speech show that properly normalized sigmoid attention matches the strong performance of softmax attention on a wide range of domains and scales, which previous attempts at sigmoid attention were unable to fully achieve.
Researcher Affiliation Industry Jason Ramapuram , Federico Danieli , Eeshan Dhekane , Floris Weers , Dan Busbridge , Pierre Ablin , Tatiana Likhomanenko , Jagrit Digani, Zijin Gu, Amitis Shidani, Russ Webb Apple Correspondence to: Jason Ramapuram <EMAIL>
Pseudocode Yes Algorithm 1 FLASHSIGMOID Forward Pass Algorithm 2 FLASHSIGMOID Backward Pass
Open Source Code Yes 1Code is available at https://github.com/apple/ml-sigmoid-attention.
Open Datasets Yes To empirically validate Sigmoid Attn, we evaluate across several domains: supervised image classification using vision transformers (Dosovitskiy et al., 2021), self-supervised image representation learning with Sim CLR (Chen et al., 2020; Zhai et al., 2023a), Bootstrap Your Own Latent (BYOL) (Grill et al., 2020; Busbridge et al., 2023) and Masked Auto Encoders (MAE) (He et al., 2022) as well as automatic speech recognition (ASR) (Synnaeve et al., 2020; Gulati et al., 2020b) and auto-regressive language modeling (LM) (Brown et al., 2020). We also validate sequence length generalization on TED-LIUM v3 (Hernandez et al., 2018) for ASR and in small scale synthetic experiments in App. G.5.4. Section 5.2 SUPERVISED IMAGE CLASSIFICATION: Image Net dataset (Deng et al., 2009) Section 5.4 AUTOMATIC SPEECH RECOGNITION (ASR): Libri Speech data (Panayotov et al., 2015) Section 5.5 AUTOREGRESSIVE LARGE LANGUAGE MODELING: Red Pajama (Computer, 2023) dataset
Dataset Splits Yes We train models until the greedy WER stops improving on the validation sets (dev-clean, dev-other) and report final test sets (test-clean, test-other) greedy WER without integration of any external language model. To perform evaluation on TED-LIUM v3, we combine together validation and test sets of TED-LIUM v3 (we don t use them for training and hyper-parameters search and just perform final evaluation) and split them into 4 datasets according to the duration: 0-10s, 10-20s, 20-30s, and 30s+.
Hardware Specification Yes We also introduce FLASHSIGMOID, a hardware-aware and memory-efficient implementation of sigmoid attention yielding a 17% inference kernel speed-up over FLASHATTENTION2 on H100 GPUs To measure the performance improvements of FLASHSIGMOID, we compare the timings of the kernels in its forward and backward passes against those of FLASHATTENTION2. The details of this benchmarking on H100 and A100 GPUs can be found in App. F.2.
Software Dependencies No The paper mentions "Py Torch (Paszke et al., 2019)", "JAX without FLASHATTENTION using the AXLearn framework4", and "T5 tokenizer (Raffel et al., 2020)". While these software tools and frameworks are mentioned, specific version numbers required for reproducibility are not provided in the main text or appendices. For example, PyTorch is cited but its version is not specified.
Experiment Setup Yes We train a single layer AR transformer block (E=3072, D_FF=12288) on the realnews split of C4 (Raffel et al., 2020). We train for 216 steps using a batch size of 6 and max sequence length of 4096 using a single cycle cosine learning rate (LR) schedule without weight decay. All models are trained with Ro PE with b ln n, using Adam W (Loshchilov & Hutter, 2017) on the realnews split of C4 with (β1, β2) = (0.9, 0.95), ϵ = 10 8, wd = 0, batch size 24, maximum token sequence length of 512 from the T5 tokenizer (Raffel et al., 2020), cosine LR schedule of 214 steps including a linear warmup of 210 steps. Appendix G.2.4: Tables 7 and 8 list detailed hyperparameters for vision models (Sim CLR, BYOL, Supervised ViT, MAE ViT). Appendix G.3.1: Table 9 lists training details for Language Model (1B and 7B). Appendix G.4.1: Table 11 lists training details for ASR models.