reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

FlashAttention on a Napkin: A Diagrammatic Approach to Deep Learning IO-Awareness

Authors: Vincent Abbott, Gioele Zardini

TMLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Theoretical	This work uses a theoretical framework to link assumptions about GPU behaviour to claims about performance. We aim to lay the groundwork for a scientific approach to GPU optimization where experiments can address clear hypotheses rather than post-hoc rationalizations.
Researcher Affiliation	Academia	Vincent Abbott EMAIL Department of Computer Science, University College London; Gioele Zardini EMAIL Laboratory for Information and Decision Systems, Massachusetts Institute of Technology
Pseudocode	Yes	In this section, we develop the tools to go from generic, abstract analysis to developing an algorithm for a specific hardware architecture. In this section, we provide a systematic procedure to go from an abstract two-level model to an algorithm configured for a specific GPU architecture. We will work with a toy hierarchy which imitates Hopper, with levels for the different modes in which memory can be stored; SMEM, registers, or tensor cores. Available subalgorithms will be dictated by the functionalities of the respective levels. The aim of this section is to not focus on Hopper specifically, as many aspects of implementation will be missed. Instead, it is to show how diagrams can be used to systematically derive a hardware-aware algorithm. This methodology can be extended to Ampere, Blackwell, and non-NVIDIA architectures. (...) 5.1 From Diagrams to Pseudocode We can expand streamed algorithms into looped pseudocode forms where all variables are explicitly shown as in Figure 27. The columns of pseudocode diagrams provide the size of variables required in memory and the transfers/operations we need to apply. This allows us to pre-allocate memory to derive the exact memory usages, as well as per-group transfer and compute costs.
Open Source Code	Yes	The performance analysis is assisted by an Excel spreadsheet we developed, available at github.com/mitzardini-lab/Napkin. In the future, additional tools will be provided at that repository.
Open Datasets	No	The paper presents a theoretical framework and methodology for algorithm optimization, primarily through diagrammatic representation and performance modeling. It does not conduct experiments on specific datasets, hence no open datasets are mentioned or provided.
Dataset Splits	No	The paper introduces a theoretical framework for algorithm optimization and performance modeling. It does not involve experimental evaluation using datasets, therefore, no dataset splits are provided.
Hardware Specification	No	The paper discusses various GPU architectures (e.g., A100s, H100 SXM5s, Hopper-like architecture, Blackwell) as examples or targets for its theoretical methodology, stating: "With a model corresponding to a Hopper-like architecture, we use a step-by-step process to derive hardware-aware algorithms (Section 5). The aim of this section is to not focus on Hopper specifically, as many aspects of implementation will be missed." This indicates the hardware is discussed for theoretical derivation and modeling, not as specific equipment used to run experiments for the paper.
Software Dependencies	No	The paper discusses existing deep learning frameworks and tools like PyTorch and Triton in the context of related work and background. However, it does not specify software dependencies with version numbers for the implementation or experiments of the proposed diagrammatic approach, as the work is theoretical and does not involve specific experimental setups requiring such details.
Experiment Setup	No	The paper focuses on developing a theoretical framework and methodology for deriving hardware-aware algorithms and performance models. It does not describe any empirical experiments, and therefore, it does not provide details about experimental setup, hyperparameters, or training configurations.