Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models

Authors: Samuel Marks, Can Rager, Eric Michaud, Yonatan Belinkov, David Bau, Aaron Mueller

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We introduce methods for discovering and applying sparse feature circuits. These are causally implicated subnetworks of human-interpretable features for explaining language model behaviors. Circuits identified in prior work consist of polysemantic and difficult-to-interpret units like attention heads or neurons, rendering them unsuitable for many downstream applications. In contrast, sparse feature circuits enable detailed understanding of unanticipated mechanisms in neural networks. Because they are based on fine-grained units, sparse feature circuits are useful for downstream tasks: We introduce SHIFT, where we improve the generalization of a classifier by ablating features that a human judges to be taskirrelevant. Finally, we demonstrate an entirely unsupervised and scalable interpretability pipeline by discovering thousands of sparse feature circuits for automatically discovered model behaviors. ... We evaluate our method, we discover sparse feature circuits (henceforth, feature circuits) on Pythia-70M and Gemma-2-2B for four variants of the subject-verb agreement task (Table 1). ... We find that SHIFT almost completely removes the classifiers dependence on gender information for both models. In the case of Gemma (but not Pythia), the feature ablations damage model performance; however, this performance is restored (without reintroducing the bias) by further training on the ambiguous set. Comparing SHIFT without retraining to the feature skyline, we further observe that SHIFT optimally or near-optimally identifies the best features to remove.
Researcher Affiliation Collaboration Samuel Marks Northeastern University Can Rager Independent Eric J. Michaud MIT Yonatan Belinkov Technion IIT David Bau Northeastern University Aaron Mueller* Northeastern University
Pseudocode No The paper does not contain explicitly labeled pseudocode or algorithm blocks. Figure 1 and Figure 2 provide high-level overviews and diagrams of the method but are not formatted as pseudocode.
Open Source Code Yes We release code, data and autoencoders at github.com/saprmarks/feature-circuits.
Open Datasets Yes We adapt data from Finlayson et al. (2021) to produce datasets consisting of contrastive pairs of inputs that differ only in the grammatical number of the subject; the model s task is to choose the appropriate verb inflection. ... We illustrate SHIFT using the Bias in Bios dataset (Bi B; De-Arteaga et al., 2019). ... Finally, we demonstrate our method s scalability by automatically discovering thousands of narrow LM behaviors for example, predicting to as an infinitive object or predicting commas in dates with the clustering approach of Michaud et al. (2023), and then automatically discovering feature circuits for these behaviors ( 5). ... The Pile (Gao et al., 2020)
Dataset Splits Yes When evaluating feature circuits for faithfuless and completeness, we use a test split of our dataset, consisting of contrastive pairs not used to discover the circuit. ... We subsample Bi B to produce two sets of labeled data: The ambiguous set, consisting of bios of male professors (labeled 0) and female nurses (labeled 1). The balanced set, consisting of an equal number of bios for male professors, male nurses, female professors, and female nurses. These data carry profession labels (the intended signal) and gender labels (the unintended signal).
Hardware Specification No The paper mentions "a large (but one-time) upfront compute cost" but does not specify any particular hardware (e.g., GPU/CPU models, specific cloud instances) used for the experiments.
Software Dependencies No The paper mentions using "Pytorch s autograd algorithm" and "scikit-learn (Pedregosa et al., 2011) spectral clustering implementation" but does not provide specific version numbers for these software components or any other key libraries.
Experiment Setup Yes We use λ = 0.1 and a learning rate of 10-4. ... We train for 120000 steps, resulting in a total of about 2 billion training tokens. ... reinitializing features which have not activated in the previous 12500 steps... Finally, we use a linear learning rate warmup of 1000 steps at the start of training and after every time that neurons are resampled. ... For probe training details: we train a linear classification head via logistic regression, using the Adam W optimizer (Loshchilov & Hutter, 2017) and learning rate 0.01 for one epoch on this dataset of activations.