reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Are Sparse Autoencoders Useful? A Case Study in Sparse Probing

Authors: Subhash Kantamneni, Joshua Engels, Senthooran Rajamanoharan, Max Tegmark, Neel Nanda

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We test this by applying SAEs to the real-world task of LLM activation probing in four regimes: data scarcity, class imbalance, label noise, and covariate shift. However, although SAEs occasionally perform better than baselines on individual datasets, we are unable to ensemble SAEs and baselines to consistently improve over just baseline methods.
Researcher Affiliation	Academia	1Massachusetts Institute of Technology. Correspondence to: Subhash Kantamneni <EMAIL>, Joshua Engels <EMAIL>.
Pseudocode	No	The paper describes methods through narrative text and mathematical equations (e.g., Equation 1), but does not include any clearly labeled 'Pseudocode' or 'Algorithm' blocks.
Open Source Code	No	The paper does not contain an explicit statement that the code for the methodology described in this paper is released, nor does it provide a link to a code repository.
Open Datasets	Yes	We collect a diverse set of 113 binary classification datasets listed in Table 4 (Appendix C). Table 4 explicitly lists dataset names and their corresponding citations, many of which refer to publicly available sources or well-known benchmarks, such as 'Gurnee & Tegmark (2024)' and 'AI, T. and Ishii, D. Spam Text Message Classification kaggle.com. https: //www.kaggle.com/datasets/team-ai/ spam-text-message-classification.'
Dataset Splits	Yes	Often, a probe p has hyperparameters hp we would like to optimize. We select hp that has the maximal validation AUC using the cross-validation strategy described in Table 5. We then test p with optimal hp on a held out test set to calculate AUCtest p . All datasets have at least 100 testing examples, with most having more (the average test set size is 1945). Table 5 provides specific selection methods for hyperparameter tuning based on data size, including 'Use 80%/20% training/validation split' for n > 128.
Hardware Specification	No	The paper mentions using specific language models like Gemma-2-9B and Llama-3.1-8B and discusses training Sparse Autoencoders (SAEs), but it does not provide any specific details about the hardware (e.g., GPU models, CPU types, or memory) used to conduct the experiments.
Software Dependencies	No	The paper mentions various models and tools used (e.g., Gemma Scope, Claude-3.5-Sonnet, GPT-4o), but it does not specify any software dependencies with version numbers (e.g., Python, PyTorch, or other libraries) that would be needed to replicate the experimental setup.
Experiment Setup	Yes	We use 5 baseline probing methods, detailed with their respective hyperparameters in Table 2. Appendix D.3 'Probing Method Hyperparameter Details' provides specific ranges and values for hyperparameters for Logistic Regression, PCA Regression, K-Nearest Neighbors (KNN), XGBoost (e.g., 'n estimators: Ranges from 50 to 250 in steps of 50'), and Multilayer Perceptron (MLP) (e.g., 'Network depth: 1 to 3 hidden layers', 'learning rate init: Five values ranging logarithmically from 10^-4 to 10^-2').