reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Jacobian Sparse Autoencoders: Sparsify Computations, Not Just Activations

Authors: Lucy Farnik, Tim Lawson, Conor Houghton, Laurence Aitchison

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	First, we find that Jacobian SAEs successfully induce sparsity in the Jacobian matrices between input and output SAE latents relative to standard SAEs without a Jacobian term (Section 5.1). We find that JSAEs achieve the desired increase in the sparsity of the Jacobian with only a slight decrease in reconstruction quality and model performance preservation, which remain roughly on par with standard SAEs. We also find that the input and output latents learned by Jacobian SAEs are approximately as interpretable as standard SAEs, as quantified by auto-interpretability scores. Importantly, we also find that the "computational units" discovered by JSAEs are often highly interpretable for example—JSAEs find an output latent corresponding to whether the text is in German, which is computed using several input latents corresponding to tokens frequently found in German text (Section 5.2).
Researcher Affiliation	Academia	1School of Engineering Mathematics and Technology, University of Bristol, Bristol, UK. Correspondence to: Lucy Farnik <EMAIL>.
Pseudocode	No	The paper includes derivations and mathematical formulas (e.g., in Appendix A, "A. Efficiently computing the Jacobian") but does not present a structured pseudocode or algorithm block.
Open Source Code	Yes	Our source code can be found at https://github.com/lucyfarnik/jacobian-saes.
Open Datasets	Yes	Our experiments were performed on LLMs from the Pythia suite (Biderman et al., 2023), the figures in the main text contain results from Pythia-410m unless otherwise specified. We train each pair of SAEs on 300 million tokens from the Pile (Gao et al., 2020), excluding the copyrighted Books3 dataset, for a single epoch.
Dataset Splits	Yes	We train each pair of SAEs on 300 million tokens from the Pile (Gao et al., 2020), excluding the copyrighted Books3 dataset, for a single epoch. ... We collected statistics over 10 million tokens from the validation subset of the C4 text dataset.
Hardware Specification	Yes	The average training durations were 72mins for a pair of JSAEs and 33 mins for a traditional SAE, with standard deviations below 30 seconds for both. We measured this by training ten of each model on Pythia-70m with an expansion factor of 32 for 100 million tokens on an RTX 3090.
Software Dependencies	No	Our training implementation is based on the open-source SAELens library (Bloom et al., 2024). We use the Adam optimizer (Kingma & Ba, 2017) with the default beta parameters... The paper mentions software and libraries like SAELens and the Adam optimizer, but it does not specify concrete version numbers for any software component.
Experiment Setup	Yes	We trained on 300 million tokens with k = 32 and an expansion factor of 64 for Pythia-410m and 32 for smaller models. We use the Adam optimizer (Kingma & Ba, 2017) with the default beta parameters and a constant learning-rate schedule with 1% warm-up steps, 20% decay steps, and a maximum value of 5 × 10−4. Additionally, we use 5% warm-up steps for the coefficient of the Jacobian term in the training loss. ... Except where noted, we use a batch size of 4096 sequences, each with a context size of 2048.