reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

ThunderKittens: Simple, Fast, and $\textit{Adorable}$ Kernels

Authors: Benjamin Spector, Simran Arora, Aaryan Singhal, Arjun Parthasarathy, Dan Fu, Christopher Re

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In experiments, we validate that THUNDERKITTENS speeds up a broad range of ML primitives. We compare to well-optimized kernels from prior work, written in alternate frameworks such as CUTLASS, Cu BLAS, general CUDA, and Triton. We compare our kernels for the workhorse operations in AI, GEMM and attention, as well as kernels for emerging AI architectures, such as linear attention and state space models (Section 4.1). We profile the kernels to understand TK s role in achieving high performance in Section 4.2. We benchmark on an NVIDIA H100 80GB SXM GPUs using CUDA 12.6 and report average TFLOPS.
Researcher Affiliation	Collaboration	1 Stanford University, 2 Columbia University, 3 UCSD, 4 Together AI EMAIL, EMAIL, EMAIL
Pseudocode	Yes	C THUNDERKITTENS KERNEL LISTINGS This section first recaps our benchmarking methodology for the results and provides a set of kernels written in the TK LCSF template and tile abstractions: 1. Appendix C.1 GEMM kernel 2. Appendix C.2 Long convolution kernel 3. Appendix C.3 Attention kernel 4. Appendix C.4 Rotary kernel
Open Source Code	Yes	Our kernels in Section 4 are benchmarked on an NVIDIA H100 80GB SXM GPU with 10 warmup and 10 timed iterations using timings measured in C++. We also provide Python-bound kernels and benchmarking infrastructure in our repository for reference.
Open Datasets	No	The paper focuses on the performance of AI kernels for operations like GEMM, attention, linear attention, and state space models, rather than experiments on specific public datasets. No datasets are mentioned with concrete access information for public availability.
Dataset Splits	No	The paper evaluates the performance of AI kernels for operations (e.g., GEMM, attention) using varying input dimensions (e.g., sequence length, head dimension, batch size) rather than training or evaluating models on traditional datasets. Therefore, there are no dataset splits (training/validation/test) mentioned.
Hardware Specification	Yes	We benchmark on an NVIDIA H100 80GB SXM GPUs using CUDA 12.6 and report average TFLOPS. We provide experiments on an NVIDIA RTX 4090 and an Apple M2 Pro in Appendix B.
Software Dependencies	Yes	We use the following software versions for benchmarking: CUDA 12.6, Triton version 3.00, and Py Torch version 2.4.
Experiment Setup	Yes	In order to ensure fair performance comparisons between TK kernels and others, we run 10 warm-up iterations then use cuda Events to measure total kernel execution time over 10 benchmarking iterations. Reported performance is the average of the 10 benchmarking iterations. Baseline GEMM kernels are tuned via a grid-search over the default execution parameters exposed (if any) and through auto-tuning methods (exposed via Cu BLASLt) for baselines, the maximum performance achieved is reported. Furthermore, for Triton kernels, we run triton.autotune over the default parameter configurations provided in baselines we compare TK kernels to. In order to avoid impacting performance measurements, kernel tuning is done in separate iterations prior to warmup and benchmarking.