Low-Rank Adapting Models for Sparse Autoencoders

Authors: Matthew Chen, Joshua Engels, Max Tegmark

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We analyze our method across SAE sparsity, SAE width, language model size, Lo RA rank, and model layer on the Gemma Scope family of SAEs. In these settings, our method reduces the cross entropy loss gap by 30% to 55% when SAEs are inserted during the forward pass. We also find that compared to end-to-end (e2e) SAEs, our approach achieves the same downstream cross entropy loss 3 to 20 faster on Gemma-2-2B and 2 to 10 faster on Llama-3.2-1B. We further show that our technique improves downstream metrics and can adapt multiple SAEs at once without harming general language model capabilities.
Researcher Affiliation Academia 1Massachusetts Institute of Technology, Cambridge, MA. Correspondence to: Matthew Chen <EMAIL>.
Pseudocode No The paper describes methods using mathematical formulations and equations (e.g., Section 3.4 'Method for Low-Rank Adapting Models to SAEs'), but it does not contain any clearly labeled pseudocode blocks or algorithms.
Open Source Code Yes 1Code available at https://github.com/matchten/LoRA-Models-for-SAEs
Open Datasets Yes We train Top K SAEs with k = 64 for Gemma-2-2B and Llama-3.2-1B for 2B and 4B tokens, respectively, on the Red Pajama dataset (Weber et al., 2024). We train on 15M random tokens of The Pile (uncopyrighted) dataset (Gao et al., 2020), and evaluate on a held out validation set of 1M random tokens. For a given SAE latent, we steer on a dataset of 500 positive and negative samples. The negative dataset consists of an equal mix of arabic tweets (Pain, 2024), medical facts (Med Alpaca, 2024), recipes (Corbt, 2024), shakespearean quotes (Roudranil, 2024), and law texts (GPT-4o-mini generated).
Dataset Splits Yes We train on 15M random tokens of The Pile (uncopyrighted) dataset (Gao et al., 2020), and evaluate on a held out validation set of 1M random tokens. For a given SAE latent, we steer on a dataset of 500 positive and negative samples. ... After tuning, we evaluate the effect of α on a test consisting of the remaining negative samples.
Hardware Specification No The paper discusses computational cost and compute limitations, but it does not provide specific details about the hardware used, such as GPU models, CPU types, or server specifications.
Software Dependencies No The paper refers to language models like Gemma-2-2B and Llama-3.2-1B, but it does not list any specific software libraries or frameworks with version numbers (e.g., PyTorch, TensorFlow, CUDA versions) that would be needed for reproducibility.
Experiment Setup Yes Unless otherwise specified, we use a layer 12 residual stream SAE. We train on 15M random tokens of The Pile (uncopyrighted) dataset (Gao et al., 2020), and evaluate on a held out validation set of 1M random tokens. We train on layer 12 of Llama-3.2-1B and Gemma-2-2B. We train Top K and e2e SAEs for 4B tokens on Llama-3.2-1B and for 2B tokens on Gemma-2-2B (similar to the number of tokens trained on for Gemma Scope SAEs). On each Top K SAE training checkpoint of Llama-3.2-1B we do Lo RA finetuning for 100M tokens, while we finetune for 15M tokens on Gemma-2-2B Top K SAE checkpoints. We use the learning rates suggested in Gao et al. (2024).