reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

KernelBench: Can LLMs Write Efficient GPU Kernels?

Authors: Anne Ouyang, Simon Guo, Simran Arora, Alex L Zhang, William Hu, Christopher Re, Azalia Mirhoseini

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We introduce Kernel Bench, an open-source framework for evaluating LMs ability to write fast and correct kernels on a suite of 250 carefully selected Py Torch ML workloads. Kernel Bench represents a real-world engineering environment and making progress on the introduced benchmark directly translates to faster practical kernels. We introduce a new evaluation metric fastp, which measures the percentage of generated kernels that are functionally correct and offer a speedup greater than an adjustable threshold p over baseline. Our experiments across various state-of-the-art models and test-time methods show that frontier reasoning models perform the best out of the box but still fall short overall, matching the Py Torch baseline in less than 20% of the cases.
Researcher Affiliation	Academia	1Department of Computer Science, Stanford University, Stanford, California, USA 2Department of Computer Science, Princeton University, Princeton, New Jersey, USA.
Pseudocode	No	The paper provides code examples of GPU kernels (Appendix A) and describes high-level workflows (Figure 1), but does not contain explicitly labeled pseudocode or algorithm blocks for its own methodology.
Open Source Code	Yes	We introduce Kernel Bench, an open-source framework for evaluating LMs ability to write fast and correct kernels on a suite of 250 carefully selected Py Torch ML workloads. Kernel Bench represents a real-world engineering environment and making progress on the introduced benchmark directly translates to faster practical kernels. We introduce a new evaluation metric fastp, which measures the percentage of generated kernels that are functionally correct and offer a speedup greater than an adjustable threshold p over baseline. Our experiments across various state-of-the-art models and test-time methods show that frontier reasoning models perform the best out of the box but still fall short overall, matching the Py Torch baseline in less than 20% of the cases. While we show that results can improve by leveraging execution and profiling feedback during iterative refinement, Kernel Bench remains a challenging benchmark, with its difficulty increasing as we raise speedup threshold p. ... To address these challenges, we contribute (1) an open-source framework to study LM kernel generation with a comprehensive suite of evaluation problems and (2) analysis of where current LMs stand and how to realize a future of efficient kernels generated by models.
Open Datasets	Yes	We introduce Kernel Bench, an open-source framework for evaluating LMs ability to write fast and correct kernels on a suite of 250 carefully selected Py Torch ML workloads. ... Kernel Bench instead curates a set of 250 diverse kernels from real-world, modern deep learning workloads, many of which do not have existing human-written implementations – in other words, solving Kernel Bench tasks are immediately beneficial for real deep learning workloads. ... The 250 tasks in Kernel Bench are partitioned into three levels, based on the number of primitive operations, or Py Torch library functions, they contain:
Dataset Splits	Yes	Given a task, we randomly generate input tensors of the prescribed shape and precision and collect the Py Torch Model output. We can evaluate whether LM generations are correct and fast as follows: 1. Correctness We compare the Model output to the LM-generated Model New output. We evaluate on 5 random inputs per problem (detailed in Appendix B). ... We use five sets of random inputs for correctness, which is a good tradeoff between the ability to catch errors and efficiency.
Hardware Specification	Yes	All evaluations are conducted on a bare-metal NVIDIA L40S GPU with Ada Lovelace architecture unless otherwise stated (such as the device generalization experiments in Section 4.4 and the hardware case study in 5.2). The NVIDIA L40S has 48 GB of HBM memory and operates at 300W. ... Table 13. Specifications of different GPUs, including memory, power consumption, micro-architecture, FP16 TFLOPS, memory bandwidth, and their providers. Provider GPU Type Memory Power Microarchitecture FP16 TFLOPS Memory Bandwidth Baremetal NVIDIA L40S 48 GB 300W Ada 362.05 864 GB/s Baremetal NVIDIA H100 80 GB 700W Hopper 989.5 3350 GB/s Serverless NVIDIA A100 42 GB 400W Ampere 312 1935 GB/s Serverless NVIDIA L4 24 GB 72W Ada 121 300 GB/s Serverless NVIDIA T4 16 GB 70W Turing 65 300 GB/s Serverless NVIDIA A10G 24 GB 300W Ampere 125 600 GB/s
Software Dependencies	Yes	Our environment uses Python 3.10, Py Torch 2.5.0+cu124, and CUDA 12.4, which is also where our Py Torch Eager and torch.compile baselines are derived from.
Experiment Setup	Yes	We sample the model with greedy decoding to ensure deterministic output, which is setting temperature = 0. ... Specifically we use temperature = 1.6 for Deepseek-V3 and temperature = 0.7 for Llama 3.1-70B. ... A limitation of our experiments is that we sample with temperature= 0 to focus on the effect of iterating based on feedback rather than introducing variability.