Towards Principled Evaluations of Sparse Autoencoders for Interpretability and Control

Authors: Aleksandar Makelov, Georg Lange, Neel Nanda

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental As a case study, we apply this framework to the indirect object identification (IOI) task using GPT-2 Small, with SAEs trained on either the IOI or Open Web Text datasets. We find that SAEs capture interpretable features for the IOI task, and that more recent SAE variants such as Gated SAEs and Top-K SAEs are competitive with supervised features in terms of disentanglement and control over the model. Our results suggest that more detailed and controlled SAE evaluations are possible and informative, and that SAEs may hold promise for disentangling model computations in realistic scenarios.
Researcher Affiliation Industry The authors list personal email addresses (EMAIL, EMAIL, EMAIL). While these are personal emails, the domains end in '.com'. In the absence of explicit institutional names (e.g., university, company) for affiliation, and given the prompt's instruction to classify .com domains under industry, the affiliation is classified as industry. However, it's worth noting the ambiguity as personal .com emails do not explicitly state corporate affiliation.
Pseudocode No The paper describes methods and procedures using prose and mathematical equations throughout, but it does not contain any structured pseudocode blocks or explicitly labeled algorithm sections.
Open Source Code No The paper does not contain an explicit statement about releasing source code, nor does it provide a link to a code repository for the methodology described in this paper.
Open Datasets Yes The paper explicitly cites and uses 'Open Web Text (Gokaslan & Cohen, 2019)' which is a known public dataset. The full reference is given: 'Aaron Gokaslan and Vanya Cohen. Openwebtext corpus. http://Skylion007.github.io/ Open Web Text Corpus, 2019.'
Dataset Splits Yes For the IOI task dataset, the paper states: 'We use GPT2-Small for the IOI task, with a dataset that spans 216 single-token names, 144 singletoken objects and 75 single-token places, which are split 1 : 1 across a training and test set.' Additionally, for task SAEs, it mentions: 'We use a training set of 20,000 examples and an evaluation set of 8,000 examples'.
Hardware Specification No The paper does not provide specific details about the hardware used for running the experiments, such as particular GPU models, CPU specifications, or memory amounts.
Software Dependencies No The paper discusses various models and methodologies (e.g., GPT-2 Small) but does not provide a list of specific ancillary software dependencies with version numbers (e.g., PyTorch 1.x, Python 3.x) used for implementing or conducting the experiments.
Experiment Setup Yes The paper provides extensive details on the experimental setup and hyperparameters, particularly in Appendix 7.13 'DETAILS FOR TRAINING SPARSE AUTOENCODERS'. For instance, it mentions: 'For all SAE variants considered, we used the same (small) learning rate of 3e-4, trained for 2000 epochs in total, and applied resampling followed by a learning rate warmup over 100 epochs... We sweep over values λ (0.5, 1.0, 2.5, 5.0) for the ℓ1 regularization penalty for vanilla and gated SAEs, and over values k (3, 6, 12, 24) for top K SAEs.'