ThunderKittens: Simple, Fast, and $\textit{Adorable}$ Kernels
Authors: Benjamin Spector, Simran Arora, Aaryan Singhal, Arjun Parthasarathy, Dan Fu, Christopher Re
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In experiments, we validate that THUNDERKITTENS speeds up a broad range of ML primitives. We compare to well-optimized kernels from prior work, written in alternate frameworks such as CUTLASS, Cu BLAS, general CUDA, and Triton. We compare our kernels for the workhorse operations in AI, GEMM and attention, as well as kernels for emerging AI architectures, such as linear attention and state space models (Section 4.1). We profile the kernels to understand TK s role in achieving high performance in Section 4.2. We benchmark on an NVIDIA H100 80GB SXM GPUs using CUDA 12.6 and report average TFLOPS. |
| Researcher Affiliation | Collaboration | 1 Stanford University, 2 Columbia University, 3 UCSD, 4 Together AI EMAIL, EMAIL, EMAIL |
| Pseudocode | Yes | C THUNDERKITTENS KERNEL LISTINGS This section first recaps our benchmarking methodology for the results and provides a set of kernels written in the TK LCSF template and tile abstractions: 1. Appendix C.1 GEMM kernel 2. Appendix C.2 Long convolution kernel 3. Appendix C.3 Attention kernel 4. Appendix C.4 Rotary kernel |
| Open Source Code | Yes | Our kernels in Section 4 are benchmarked on an NVIDIA H100 80GB SXM GPU with 10 warmup and 10 timed iterations using timings measured in C++. We also provide Python-bound kernels and benchmarking infrastructure in our repository for reference. |
| Open Datasets | No | The paper focuses on the performance of AI kernels for operations like GEMM, attention, linear attention, and state space models, rather than experiments on specific public datasets. No datasets are mentioned with concrete access information for public availability. |
| Dataset Splits | No | The paper evaluates the performance of AI kernels for operations (e.g., GEMM, attention) using varying input dimensions (e.g., sequence length, head dimension, batch size) rather than training or evaluating models on traditional datasets. Therefore, there are no dataset splits (training/validation/test) mentioned. |
| Hardware Specification | Yes | We benchmark on an NVIDIA H100 80GB SXM GPUs using CUDA 12.6 and report average TFLOPS. We provide experiments on an NVIDIA RTX 4090 and an Apple M2 Pro in Appendix B. |
| Software Dependencies | Yes | We use the following software versions for benchmarking: CUDA 12.6, Triton version 3.00, and Py Torch version 2.4. |
| Experiment Setup | Yes | In order to ensure fair performance comparisons between TK kernels and others, we run 10 warm-up iterations then use cuda Events to measure total kernel execution time over 10 benchmarking iterations. Reported performance is the average of the 10 benchmarking iterations. Baseline GEMM kernels are tuned via a grid-search over the default execution parameters exposed (if any) and through auto-tuning methods (exposed via Cu BLASLt) for baselines, the maximum performance achieved is reported. Furthermore, for Triton kernels, we run triton.autotune over the default parameter configurations provided in baselines we compare TK kernels to. In order to avoid impacting performance measurements, kernel tuning is done in separate iterations prior to warmup and benchmarking. |