Not All Language Model Features Are One-Dimensionally Linear

Authors: Josh Engels, Eric Michaud, Isaac Liao, Wes Gurnee, Max Tegmark

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We begin by developing a rigorous definition of irreducible multi-dimensional features based on whether they can be decomposed into either independent or non-co-occurring lower-dimensional features. Motivated by these definitions, we design a scalable method that uses sparse autoencoders to automatically find multi-dimensional features in GPT-2 and Mistral 7B. These auto-discovered features include strikingly interpretable examples, e.g. circular features representing days of the week and months of the year. We identify tasks where these exact circles are used to solve computational problems involving modular arithmetic in days of the week and months of the year. Next, we provide evidence that these circular features are indeed the fundamental unit of computation in these tasks with intervention experiments on Mistral 7B and Llama 3 8B, and we examine the continuity of the days of the week feature in Mistral 7B.
Researcher Affiliation Academia Joshua Engels MIT EMAIL Eric J. Michaud MIT & IAIFI EMAIL Isaac Liao MIT EMAIL Wes Gurnee MIT EMAIL Max Tegmark MIT & IAIFI EMAIL
Pseudocode Yes Pseudocode for this method is in the appendix in Alg. 1. This method succeeds on toy datasets of synthetic irreducible multi-dimensional features; see Appendix D.3
Open Source Code Yes Code: https://github.com/JoshEngels/MultiDimensionalFeatures
Open Datasets Yes We apply this method to language models using GPT-2 (Radford et al., 2019) SAEs trained by Bloom (2024) for every layer and Mistral 7B (Jiang et al., 2023) SAEs that we train on layers 8, 16, and 24 (training details in Appendix E). Our Mistral 7B (Jiang et al., 2023) sparse autoencoders (SAEs) are trained on over one billion tokens from a subset of the Pile (Gao et al., 2020) and Alpaca (Peng et al., 2023) datasets.
Dataset Splits No For Weekdays, we range over the 7 days of the week and durations between 1 and 7 days to get 49 prompts. For Months, we range over the 12 months of the year and durations between 1 and 12 months to get 144 prompts. We run our patching on all 49 Weekday problems and 144 Month problems and use as clean runs the 6 or 11 other possible values for β, resulting in a total of 49 * 6 patching experiments for Weekdays and 144 * 11 patching experiments for Months.
Hardware Specification Yes Intervention experiments were run on two V100 GPUs using less than 64 GB of CPU RAM; all experiments can be reproduced from our open source repository in less than a day with this configuration. Mistral SAE training was run on a single V100 GPU.
Software Dependencies No We use the Transformer Lens library (Nanda & Bloom, 2022) for intervention experiments.
Experiment Setup Yes Our Mistral 7B (Jiang et al., 2023) sparse autoencoders (SAEs) are trained on over one billion tokens from a subset of the Pile (Gao et al., 2020) and Alpaca (Peng et al., 2023) datasets. We use a 16 expansion factor, yielding a total of 65536 dictionary elements for each SAE. To train our SAEs, we use an Lp sparsity penalty for p = 1/2 with sparsity coefficient λ = 0.012. Before an SAE forward pass, we normalize our activation vectors to have norm dmodel = 64 in the case of Mistral. We do not apply a pre-encoder bias. We use an Adam W optimizer with weight decay 10^-3 and learning rate 0.0002 with a linear warm up. We apply dead feature resampling (Bricken et al., 2023) five times over the course of training to converge on SAEs with around 1000 dead features. In our experiments, we set k = 5.