Are Sparse Autoencoders Useful? A Case Study in Sparse Probing
Authors: Subhash Kantamneni, Joshua Engels, Senthooran Rajamanoharan, Max Tegmark, Neel Nanda
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We test this by applying SAEs to the real-world task of LLM activation probing in four regimes: data scarcity, class imbalance, label noise, and covariate shift. However, although SAEs occasionally perform better than baselines on individual datasets, we are unable to ensemble SAEs and baselines to consistently improve over just baseline methods. |
| Researcher Affiliation | Academia | 1Massachusetts Institute of Technology. Correspondence to: Subhash Kantamneni <EMAIL>, Joshua Engels <EMAIL>. |
| Pseudocode | No | The paper describes methods through narrative text and mathematical equations (e.g., Equation 1), but does not include any clearly labeled 'Pseudocode' or 'Algorithm' blocks. |
| Open Source Code | No | The paper does not contain an explicit statement that the code for the methodology described in this paper is released, nor does it provide a link to a code repository. |
| Open Datasets | Yes | We collect a diverse set of 113 binary classification datasets listed in Table 4 (Appendix C). Table 4 explicitly lists dataset names and their corresponding citations, many of which refer to publicly available sources or well-known benchmarks, such as 'Gurnee & Tegmark (2024)' and 'AI, T. and Ishii, D. Spam Text Message Classification kaggle.com. https: //www.kaggle.com/datasets/team-ai/ spam-text-message-classification.' |
| Dataset Splits | Yes | Often, a probe p has hyperparameters hp we would like to optimize. We select hp that has the maximal validation AUC using the cross-validation strategy described in Table 5. We then test p with optimal hp on a held out test set to calculate AUCtest p . All datasets have at least 100 testing examples, with most having more (the average test set size is 1945). Table 5 provides specific selection methods for hyperparameter tuning based on data size, including 'Use 80%/20% training/validation split' for n > 128. |
| Hardware Specification | No | The paper mentions using specific language models like Gemma-2-9B and Llama-3.1-8B and discusses training Sparse Autoencoders (SAEs), but it does not provide any specific details about the hardware (e.g., GPU models, CPU types, or memory) used to conduct the experiments. |
| Software Dependencies | No | The paper mentions various models and tools used (e.g., Gemma Scope, Claude-3.5-Sonnet, GPT-4o), but it does not specify any software dependencies with version numbers (e.g., Python, PyTorch, or other libraries) that would be needed to replicate the experimental setup. |
| Experiment Setup | Yes | We use 5 baseline probing methods, detailed with their respective hyperparameters in Table 2. Appendix D.3 'Probing Method Hyperparameter Details' provides specific ranges and values for hyperparameters for Logistic Regression, PCA Regression, K-Nearest Neighbors (KNN), XGBoost (e.g., 'n estimators: Ranges from 50 to 250 in steps of 50'), and Multilayer Perceptron (MLP) (e.g., 'Network depth: 1 to 3 hidden layers', 'learning rate init: Five values ranging logarithmically from 10^-4 to 10^-2'). |