Decomposing The Dark Matter of Sparse Autoencoders

Authors: Joshua Engels, Logan Riggs Smith, Max Tegmark

TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We run experiments 1 on Gemma 2 2B and 9B (Team et al., 2024) and Llama 3.1 8B (AI@Meta, 2024).
Researcher Affiliation Academia Joshua Engels EMAIL MIT Logan Smith EMAIL Independent Max Tegmark EMAIL MIT & IAIFI
Pseudocode No The paper describes methods and processes using mathematical equations and textual descriptions, but does not contain any explicitly labeled 'Pseudocode' or 'Algorithm' blocks or figures with structured, code-like steps.
Open Source Code Yes 1Code at https://anonymous.4open.science/r/SAE-Dark-Matter-1163
Open Datasets Yes We use 300 contexts of 1024 tokens from the uncopywrited subset of the Pile (Gao et al., 2020)
Dataset Splits Yes For linear regressions, we use a random subset of size 150k as training examples (since all models have a dimension of less than 5000, this prevents overfitting) and report the R2 on the other 97k activations.
Hardware Specification No The paper does not provide specific hardware details (e.g., GPU/CPU models, processor types, memory amounts, or detailed computer specifications) used for running its experiments. It mentions language models used (Gemma, Llama) but not the machines they ran on.
Software Dependencies No The paper does not provide specific ancillary software details with version numbers (e.g., library or solver names with version numbers like Python 3.8, PyTorch 1.9) needed to replicate the experiment.
Experiment Setup Yes For linear regressions, we use a random subset of size 150k as training examples (since all models have a dimension of less than 5000, this prevents overfitting) and report the R2 on the other 97k activations. For linear transformations to a multi-dimensional output, we report the average R2 across dimensions. We include bias terms in our linear regressions but omit them from equations for simplicity. We train SAEs to convergence (about 100M tokens) on each of these components of error and find that the SAE trained on Nonlinear Error(x) converges to a fraction of variance unexplained an absolute 5 percent higher than the SAE trained on the linear component of SAE error ( 0.59 and 0.54 respectively).