Jacobian Sparse Autoencoders: Sparsify Computations, Not Just Activations
Authors: Lucy Farnik, Tim Lawson, Conor Houghton, Laurence Aitchison
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | First, we find that Jacobian SAEs successfully induce sparsity in the Jacobian matrices between input and output SAE latents relative to standard SAEs without a Jacobian term (Section 5.1). We find that JSAEs achieve the desired increase in the sparsity of the Jacobian with only a slight decrease in reconstruction quality and model performance preservation, which remain roughly on par with standard SAEs. We also find that the input and output latents learned by Jacobian SAEs are approximately as interpretable as standard SAEs, as quantified by auto-interpretability scores. Importantly, we also find that the "computational units" discovered by JSAEs are often highly interpretable for example—JSAEs find an output latent corresponding to whether the text is in German, which is computed using several input latents corresponding to tokens frequently found in German text (Section 5.2). |
| Researcher Affiliation | Academia | 1School of Engineering Mathematics and Technology, University of Bristol, Bristol, UK. Correspondence to: Lucy Farnik <EMAIL>. |
| Pseudocode | No | The paper includes derivations and mathematical formulas (e.g., in Appendix A, "A. Efficiently computing the Jacobian") but does not present a structured pseudocode or algorithm block. |
| Open Source Code | Yes | Our source code can be found at https://github.com/lucyfarnik/jacobian-saes. |
| Open Datasets | Yes | Our experiments were performed on LLMs from the Pythia suite (Biderman et al., 2023), the figures in the main text contain results from Pythia-410m unless otherwise specified. We train each pair of SAEs on 300 million tokens from the Pile (Gao et al., 2020), excluding the copyrighted Books3 dataset, for a single epoch. |
| Dataset Splits | Yes | We train each pair of SAEs on 300 million tokens from the Pile (Gao et al., 2020), excluding the copyrighted Books3 dataset, for a single epoch. ... We collected statistics over 10 million tokens from the validation subset of the C4 text dataset. |
| Hardware Specification | Yes | The average training durations were 72mins for a pair of JSAEs and 33 mins for a traditional SAE, with standard deviations below 30 seconds for both. We measured this by training ten of each model on Pythia-70m with an expansion factor of 32 for 100 million tokens on an RTX 3090. |
| Software Dependencies | No | Our training implementation is based on the open-source SAELens library (Bloom et al., 2024). We use the Adam optimizer (Kingma & Ba, 2017) with the default beta parameters... The paper mentions software and libraries like SAELens and the Adam optimizer, but it does not specify concrete version numbers for any software component. |
| Experiment Setup | Yes | We trained on 300 million tokens with k = 32 and an expansion factor of 64 for Pythia-410m and 32 for smaller models. We use the Adam optimizer (Kingma & Ba, 2017) with the default beta parameters and a constant learning-rate schedule with 1% warm-up steps, 20% decay steps, and a maximum value of 5 × 10−4. Additionally, we use 5% warm-up steps for the coefficient of the Jacobian term in the training loss. ... Except where noted, we use a batch size of 4096 sequences, each with a context size of 2048. |