Jacobian Sparse Autoencoders: Sparsify Computations, Not Just Activations

Authors: Lucy Farnik, Tim Lawson, Conor Houghton, Laurence Aitchison

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental First, we find that Jacobian SAEs successfully induce sparsity in the Jacobian matrices between input and output SAE latents relative to standard SAEs without a Jacobian term (Section 5.1). We find that JSAEs achieve the desired increase in the sparsity of the Jacobian with only a slight decrease in reconstruction quality and model performance preservation, which remain roughly on par with standard SAEs. We also find that the input and output latents learned by Jacobian SAEs are approximately as interpretable as standard SAEs, as quantified by auto-interpretability scores. Importantly, we also find that the "computational units" discovered by JSAEs are often highly interpretable for example—JSAEs find an output latent corresponding to whether the text is in German, which is computed using several input latents corresponding to tokens frequently found in German text (Section 5.2).
Researcher Affiliation Academia 1School of Engineering Mathematics and Technology, University of Bristol, Bristol, UK. Correspondence to: Lucy Farnik <EMAIL>.
Pseudocode No The paper includes derivations and mathematical formulas (e.g., in Appendix A, "A. Efficiently computing the Jacobian") but does not present a structured pseudocode or algorithm block.
Open Source Code Yes Our source code can be found at https://github.com/lucyfarnik/jacobian-saes.
Open Datasets Yes Our experiments were performed on LLMs from the Pythia suite (Biderman et al., 2023), the figures in the main text contain results from Pythia-410m unless otherwise specified. We train each pair of SAEs on 300 million tokens from the Pile (Gao et al., 2020), excluding the copyrighted Books3 dataset, for a single epoch.
Dataset Splits Yes We train each pair of SAEs on 300 million tokens from the Pile (Gao et al., 2020), excluding the copyrighted Books3 dataset, for a single epoch. ... We collected statistics over 10 million tokens from the validation subset of the C4 text dataset.
Hardware Specification Yes The average training durations were 72mins for a pair of JSAEs and 33 mins for a traditional SAE, with standard deviations below 30 seconds for both. We measured this by training ten of each model on Pythia-70m with an expansion factor of 32 for 100 million tokens on an RTX 3090.
Software Dependencies No Our training implementation is based on the open-source SAELens library (Bloom et al., 2024). We use the Adam optimizer (Kingma & Ba, 2017) with the default beta parameters... The paper mentions software and libraries like SAELens and the Adam optimizer, but it does not specify concrete version numbers for any software component.
Experiment Setup Yes We trained on 300 million tokens with k = 32 and an expansion factor of 64 for Pythia-410m and 32 for smaller models. We use the Adam optimizer (Kingma & Ba, 2017) with the default beta parameters and a constant learning-rate schedule with 1% warm-up steps, 20% decay steps, and a maximum value of 5 × 10−4. Additionally, we use 5% warm-up steps for the coefficient of the Jacobian term in the training loss. ... Except where noted, we use a batch size of 4096 sequences, each with a context size of 2048.