reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Low-Rank Adapting Models for Sparse Autoencoders

Authors: Matthew Chen, Joshua Engels, Max Tegmark

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We analyze our method across SAE sparsity, SAE width, language model size, Lo RA rank, and model layer on the Gemma Scope family of SAEs. In these settings, our method reduces the cross entropy loss gap by 30% to 55% when SAEs are inserted during the forward pass. We also find that compared to end-to-end (e2e) SAEs, our approach achieves the same downstream cross entropy loss 3 to 20 faster on Gemma-2-2B and 2 to 10 faster on Llama-3.2-1B. We further show that our technique improves downstream metrics and can adapt multiple SAEs at once without harming general language model capabilities.
Researcher Affiliation	Academia	1Massachusetts Institute of Technology, Cambridge, MA. Correspondence to: Matthew Chen <EMAIL>.
Pseudocode	No	The paper describes methods using mathematical formulations and equations (e.g., Section 3.4 'Method for Low-Rank Adapting Models to SAEs'), but it does not contain any clearly labeled pseudocode blocks or algorithms.
Open Source Code	Yes	1Code available at https://github.com/matchten/LoRA-Models-for-SAEs
Open Datasets	Yes	We train Top K SAEs with k = 64 for Gemma-2-2B and Llama-3.2-1B for 2B and 4B tokens, respectively, on the Red Pajama dataset (Weber et al., 2024). We train on 15M random tokens of The Pile (uncopyrighted) dataset (Gao et al., 2020), and evaluate on a held out validation set of 1M random tokens. For a given SAE latent, we steer on a dataset of 500 positive and negative samples. The negative dataset consists of an equal mix of arabic tweets (Pain, 2024), medical facts (Med Alpaca, 2024), recipes (Corbt, 2024), shakespearean quotes (Roudranil, 2024), and law texts (GPT-4o-mini generated).
Dataset Splits	Yes	We train on 15M random tokens of The Pile (uncopyrighted) dataset (Gao et al., 2020), and evaluate on a held out validation set of 1M random tokens. For a given SAE latent, we steer on a dataset of 500 positive and negative samples. ... After tuning, we evaluate the effect of α on a test consisting of the remaining negative samples.
Hardware Specification	No	The paper discusses computational cost and compute limitations, but it does not provide specific details about the hardware used, such as GPU models, CPU types, or server specifications.
Software Dependencies	No	The paper refers to language models like Gemma-2-2B and Llama-3.2-1B, but it does not list any specific software libraries or frameworks with version numbers (e.g., PyTorch, TensorFlow, CUDA versions) that would be needed for reproducibility.
Experiment Setup	Yes	Unless otherwise specified, we use a layer 12 residual stream SAE. We train on 15M random tokens of The Pile (uncopyrighted) dataset (Gao et al., 2020), and evaluate on a held out validation set of 1M random tokens. We train on layer 12 of Llama-3.2-1B and Gemma-2-2B. We train Top K and e2e SAEs for 4B tokens on Llama-3.2-1B and for 2B tokens on Gemma-2-2B (similar to the number of tokens trained on for Gemma Scope SAEs). On each Top K SAE training checkpoint of Llama-3.2-1B we do Lo RA finetuning for 100M tokens, while we finetune for 15M tokens on Gemma-2-2B Top K SAE checkpoints. We use the learning rates suggested in Gao et al. (2024).