FlashAttention on a Napkin: A Diagrammatic Approach to Deep Learning IO-Awareness

Authors: Vincent Abbott, Gioele Zardini

TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Theoretical This work uses a theoretical framework to link assumptions about GPU behaviour to claims about performance. We aim to lay the groundwork for a scientific approach to GPU optimization where experiments can address clear hypotheses rather than post-hoc rationalizations.
Researcher Affiliation Academia Vincent Abbott EMAIL Department of Computer Science, University College London; Gioele Zardini EMAIL Laboratory for Information and Decision Systems, Massachusetts Institute of Technology
Pseudocode Yes In this section, we develop the tools to go from generic, abstract analysis to developing an algorithm for a specific hardware architecture. In this section, we provide a systematic procedure to go from an abstract two-level model to an algorithm configured for a specific GPU architecture. We will work with a toy hierarchy which imitates Hopper, with levels for the different modes in which memory can be stored; SMEM, registers, or tensor cores. Available subalgorithms will be dictated by the functionalities of the respective levels. The aim of this section is to not focus on Hopper specifically, as many aspects of implementation will be missed. Instead, it is to show how diagrams can be used to systematically derive a hardware-aware algorithm. This methodology can be extended to Ampere, Blackwell, and non-NVIDIA architectures. (...) 5.1 From Diagrams to Pseudocode We can expand streamed algorithms into looped pseudocode forms where all variables are explicitly shown as in Figure 27. The columns of pseudocode diagrams provide the size of variables required in memory and the transfers/operations we need to apply. This allows us to pre-allocate memory to derive the exact memory usages, as well as per-group transfer and compute costs.
Open Source Code Yes The performance analysis is assisted by an Excel spreadsheet we developed, available at github.com/mitzardini-lab/Napkin. In the future, additional tools will be provided at that repository.
Open Datasets No The paper presents a theoretical framework and methodology for algorithm optimization, primarily through diagrammatic representation and performance modeling. It does not conduct experiments on specific datasets, hence no open datasets are mentioned or provided.
Dataset Splits No The paper introduces a theoretical framework for algorithm optimization and performance modeling. It does not involve experimental evaluation using datasets, therefore, no dataset splits are provided.
Hardware Specification No The paper discusses various GPU architectures (e.g., A100s, H100 SXM5s, Hopper-like architecture, Blackwell) as examples or targets for its theoretical methodology, stating: "With a model corresponding to a Hopper-like architecture, we use a step-by-step process to derive hardware-aware algorithms (Section 5). The aim of this section is to not focus on Hopper specifically, as many aspects of implementation will be missed." This indicates the hardware is discussed for theoretical derivation and modeling, not as specific equipment used to run experiments for the paper.
Software Dependencies No The paper discusses existing deep learning frameworks and tools like PyTorch and Triton in the context of related work and background. However, it does not specify software dependencies with version numbers for the implementation or experiments of the proposed diagrammatic approach, as the work is theoretical and does not involve specific experimental setups requiring such details.
Experiment Setup No The paper focuses on developing a theoretical framework and methodology for deriving hardware-aware algorithms and performance models. It does not describe any empirical experiments, and therefore, it does not provide details about experimental setup, hyperparameters, or training configurations.