reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Decomposing The Dark Matter of Sparse Autoencoders

Authors: Joshua Engels, Logan Riggs Smith, Max Tegmark

TMLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We run experiments 1 on Gemma 2 2B and 9B (Team et al., 2024) and Llama 3.1 8B (AI@Meta, 2024).
Researcher Affiliation	Academia	Joshua Engels EMAIL MIT Logan Smith EMAIL Independent Max Tegmark EMAIL MIT & IAIFI
Pseudocode	No	The paper describes methods and processes using mathematical equations and textual descriptions, but does not contain any explicitly labeled 'Pseudocode' or 'Algorithm' blocks or figures with structured, code-like steps.
Open Source Code	Yes	1Code at https://anonymous.4open.science/r/SAE-Dark-Matter-1163
Open Datasets	Yes	We use 300 contexts of 1024 tokens from the uncopywrited subset of the Pile (Gao et al., 2020)
Dataset Splits	Yes	For linear regressions, we use a random subset of size 150k as training examples (since all models have a dimension of less than 5000, this prevents overfitting) and report the R2 on the other 97k activations.
Hardware Specification	No	The paper does not provide specific hardware details (e.g., GPU/CPU models, processor types, memory amounts, or detailed computer specifications) used for running its experiments. It mentions language models used (Gemma, Llama) but not the machines they ran on.
Software Dependencies	No	The paper does not provide specific ancillary software details with version numbers (e.g., library or solver names with version numbers like Python 3.8, PyTorch 1.9) needed to replicate the experiment.
Experiment Setup	Yes	For linear regressions, we use a random subset of size 150k as training examples (since all models have a dimension of less than 5000, this prevents overfitting) and report the R2 on the other 97k activations. For linear transformations to a multi-dimensional output, we report the average R2 across dimensions. We include bias terms in our linear regressions but omit them from equations for simplicity. We train SAEs to convergence (about 100M tokens) on each of these components of error and find that the SAE trained on Nonlinear Error(x) converges to a fraction of variance unexplained an absolute 5 percent higher than the SAE trained on the linear component of SAE error ( 0.59 and 0.54 respectively).