Residual Stream Analysis with Multi-Layer SAEs

Authors: Tim Lawson, Lucy Farnik, Conor Houghton, Laurence Aitchison

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We show that multi-layer SAEs achieve comparable reconstruction error and downstream loss to single-layer SAEs while allowing us to directly identify and analyze features that are active at multiple layers (Section 4.1). When aggregating over a large sample of tokens, we find that individual latents are likely to be active at multiple layers, and this measure increases with the number of latents. However, for a single token, latent activations are more likely to be isolated to a single layer. For larger underlying transformers, we show that the residual stream activation vectors at adjacent layers are more similar and that the degree to which latents are active at multiple layers increases.
Researcher Affiliation Academia Tim Lawson Lucy Farnik Conor Houghton Laurence Aitchison School of Engineering Mathematics and Technology University of Bristol Bristol, UK
Pseudocode No The paper provides mathematical equations describing the encoder, decoder, loss functions, and tuned lens transformations. However, it does not include any clearly labeled 'Pseudocode' or 'Algorithm' blocks with structured steps for a method or procedure.
Open Source Code Yes We release our code to train and analyze MLSAEs at https://github.com/tim-lawson/mlsae.
Open Datasets Yes We train MLSAEs primarily on GPT-style language models from the Pythia suite (Biderman et al., 2023)... We train each autoencoder on 1 billion tokens from the Pile (Gao et al., 2020), excluding the copyrighted Books3 dataset
Dataset Splits No We train each autoencoder on 1 billion tokens from the Pile (Gao et al., 2020)... for a single epoch... We use an effective batch size of 131072 tokens (64 sequences) for all experiments... We report the values of these metrics over one million tokens from the test set.
Hardware Specification Yes We trained most MLSAEs on a single NVIDIA Ge Force RTX 3090 GPU for between 12 and 24 hours; we trained the largest MLSAEs (e.g., with Pythia-1b or an expansion factor of R = 256) on a single NVIDIA A100 80GB GPU for up to three days.
Software Dependencies No The paper mentions using the Adam optimizer (Kingma & Ba, 2017) and refers to existing implementations (Gao et al., 2023; Belrose, 2024). However, it does not specify version numbers for any software libraries, programming languages, or other dependencies necessary to replicate the experiment environment.
Experiment Setup Yes Our hyperparameters are the expansion factor R = n/d, the ratio of the number of latents to the model dimension, and the sparsity k, the number of largest latents to keep in the Top K activation function. We choose expansion factors as powers of 2 between 1 and 256... and k as powers of 2 between 16 and 512... Following Gao et al. (2024), we choose kaux as a power of 2 close to d/2 and α = 1/32... We use the Adam optimizer (Kingma & Ba, 2017) with the default β parameters, a constant learning rate of 1 × 10−4, and ϵ = 6.25 × 10−10. We use an effective batch size of 131072 tokens (64 sequences) for all experiments.