Are Sparse Autoencoders Useful? A Case Study in Sparse Probing

Authors: Subhash Kantamneni, Joshua Engels, Senthooran Rajamanoharan, Max Tegmark, Neel Nanda

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We test this by applying SAEs to the real-world task of LLM activation probing in four regimes: data scarcity, class imbalance, label noise, and covariate shift. However, although SAEs occasionally perform better than baselines on individual datasets, we are unable to ensemble SAEs and baselines to consistently improve over just baseline methods.
Researcher Affiliation Academia 1Massachusetts Institute of Technology. Correspondence to: Subhash Kantamneni <EMAIL>, Joshua Engels <EMAIL>.
Pseudocode No The paper describes methods through narrative text and mathematical equations (e.g., Equation 1), but does not include any clearly labeled 'Pseudocode' or 'Algorithm' blocks.
Open Source Code No The paper does not contain an explicit statement that the code for the methodology described in this paper is released, nor does it provide a link to a code repository.
Open Datasets Yes We collect a diverse set of 113 binary classification datasets listed in Table 4 (Appendix C). Table 4 explicitly lists dataset names and their corresponding citations, many of which refer to publicly available sources or well-known benchmarks, such as 'Gurnee & Tegmark (2024)' and 'AI, T. and Ishii, D. Spam Text Message Classification kaggle.com. https: //www.kaggle.com/datasets/team-ai/ spam-text-message-classification.'
Dataset Splits Yes Often, a probe p has hyperparameters hp we would like to optimize. We select hp that has the maximal validation AUC using the cross-validation strategy described in Table 5. We then test p with optimal hp on a held out test set to calculate AUCtest p . All datasets have at least 100 testing examples, with most having more (the average test set size is 1945). Table 5 provides specific selection methods for hyperparameter tuning based on data size, including 'Use 80%/20% training/validation split' for n > 128.
Hardware Specification No The paper mentions using specific language models like Gemma-2-9B and Llama-3.1-8B and discusses training Sparse Autoencoders (SAEs), but it does not provide any specific details about the hardware (e.g., GPU models, CPU types, or memory) used to conduct the experiments.
Software Dependencies No The paper mentions various models and tools used (e.g., Gemma Scope, Claude-3.5-Sonnet, GPT-4o), but it does not specify any software dependencies with version numbers (e.g., Python, PyTorch, or other libraries) that would be needed to replicate the experimental setup.
Experiment Setup Yes We use 5 baseline probing methods, detailed with their respective hyperparameters in Table 2. Appendix D.3 'Probing Method Hyperparameter Details' provides specific ranges and values for hyperparameters for Logistic Regression, PCA Regression, K-Nearest Neighbors (KNN), XGBoost (e.g., 'n estimators: Ranges from 50 to 250 in steps of 50'), and Multilayer Perceptron (MLP) (e.g., 'Network depth: 1 to 3 hidden layers', 'learning rate init: Five values ranging logarithmically from 10^-4 to 10^-2').