reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

SAEBench: A Comprehensive Benchmark for Sparse Autoencoders in Language Model Interpretability

Authors: Adam Karvonen, Can Rager, Johnny Lin, Curt Tigges, Joseph Isaac Bloom, David Chanin, Yeu-Tong Lau, Eoin Farrell, Callum Stuart Mcdougall, Kola Ayonrinde, Demian Till, Matthew Wearden, Arthur Conmy, Samuel Marks, Neel Nanda

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We introduce SAEBench, a comprehensive evaluation suite that measures SAE performance across eight diverse metrics, spanning interpretability, feature disentanglement and practical applications like unlearning. To enable systematic comparison, we open-source a suite of over 200 SAEs across seven recently proposed SAE architectures and training algorithms. Our evaluation reveals that gains on proxy metrics do not reliably translate to better practical performance.
Researcher Affiliation	Collaboration	1Independent 2Decode Research 3University College London 4Cambridge Consultants 5MATS Research 6Anthropic. Correspondence to: Adam Karvonen <EMAIL>, Can Rager <EMAIL>.
Pseudocode	No	The paper describes methods and objectives in detail, but does not present any explicitly labeled pseudocode or algorithm blocks. Equations are used to formalize objectives, but these are not structured algorithms.
Open Source Code	Yes	Code and models available at: github.com/adamkarvonen/SAEBench
Open Datasets	Yes	Dataset The Pile
Dataset Splits	Yes	For each dataset class, we structure the task as a one-versus-all binary classification task... We sample 4,000 training and 1,000 test examples per binary classification task and truncate all inputs to 128 tokens.
Hardware Specification	Yes	The computational requirements for running SAEBench evaluations were measured on an NVIDIA RTX 3090 GPU using 16K width SAEs trained on the Gemma-2-2B model.
Software Dependencies	No	The paper mentions training SAEs using the open source library dictionary learning (Marks et al., 2024b) and using gpt4o-mini as an LLM judge, but specific version numbers for these or other software dependencies are not provided.
Experiment Setup	Yes	Hyperparameter Value Tokens processed 500M Learning rate 3 10 4 Learning rate warmup (from 0) 1,000 steps Sparsity penalty warmup (from 0) 5,000 steps Learning rate decay (to 0) Last 20% of training Dataset The Pile Batch size 2,048 LLM context length 1,024