SAEBench: A Comprehensive Benchmark for Sparse Autoencoders in Language Model Interpretability
Authors: Adam Karvonen, Can Rager, Johnny Lin, Curt Tigges, Joseph Isaac Bloom, David Chanin, Yeu-Tong Lau, Eoin Farrell, Callum Stuart Mcdougall, Kola Ayonrinde, Demian Till, Matthew Wearden, Arthur Conmy, Samuel Marks, Neel Nanda
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We introduce SAEBench, a comprehensive evaluation suite that measures SAE performance across eight diverse metrics, spanning interpretability, feature disentanglement and practical applications like unlearning. To enable systematic comparison, we open-source a suite of over 200 SAEs across seven recently proposed SAE architectures and training algorithms. Our evaluation reveals that gains on proxy metrics do not reliably translate to better practical performance. |
| Researcher Affiliation | Collaboration | 1Independent 2Decode Research 3University College London 4Cambridge Consultants 5MATS Research 6Anthropic. Correspondence to: Adam Karvonen <EMAIL>, Can Rager <EMAIL>. |
| Pseudocode | No | The paper describes methods and objectives in detail, but does not present any explicitly labeled pseudocode or algorithm blocks. Equations are used to formalize objectives, but these are not structured algorithms. |
| Open Source Code | Yes | Code and models available at: github.com/adamkarvonen/SAEBench |
| Open Datasets | Yes | Dataset The Pile |
| Dataset Splits | Yes | For each dataset class, we structure the task as a one-versus-all binary classification task... We sample 4,000 training and 1,000 test examples per binary classification task and truncate all inputs to 128 tokens. |
| Hardware Specification | Yes | The computational requirements for running SAEBench evaluations were measured on an NVIDIA RTX 3090 GPU using 16K width SAEs trained on the Gemma-2-2B model. |
| Software Dependencies | No | The paper mentions training SAEs using the open source library dictionary learning (Marks et al., 2024b) and using gpt4o-mini as an LLM judge, but specific version numbers for these or other software dependencies are not provided. |
| Experiment Setup | Yes | Hyperparameter Value Tokens processed 500M Learning rate 3 10 4 Learning rate warmup (from 0) 1,000 steps Sparsity penalty warmup (from 0) 5,000 steps Learning rate decay (to 0) Last 20% of training Dataset The Pile Batch size 2,048 LLM context length 1,024 |