Scaling Sparse Feature Circuits For Studying In-Context Learning

Authors: Dmitrii Kharlapenko, Stepan Shabalin, Arthur Conmy, Neel Nanda

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate TVC against four baseline approaches... We conducted extensive parameter sweeps... To validate the causal relevance of our decomposed task features, we conducted a series of steering experiments... To evaluate our SFC modifications, we measured faithfulness through ablation studies on our ICL task dataset.
Researcher Affiliation Academia 1ETH Zurich, Switzerland 2Georgia Institute of Technology, US
Pseudocode Yes Algorithm 1. Pseudocode for Task Vector Cleaning. Algorithm 2. Pseudocode for Sparse Feature Circuits indirect effect calculation.
Open Source Code No We will also plan to share SAE training codebase in JAX with a full suite of SAEs for Gemma 1 2B after the paper publication. Our SAEs and training code will be made public after paper publication.
Open Datasets Yes Our dataset for circuit finding is primarily derived from the function vectors paper Todd et al. (2024), which provides a diverse set of tasks for evaluating the existence and properties of function vectors in language models. We train residual and attention output SAEs as well as transcoders for layers 1-18 of the model on Fine Web Penedo et al. (2024).
Dataset Splits Yes The cleaning process is performed on a training batch of 24 pairs, with evaluation conducted on an additional 24 pairs. All prompts are zero-shot. Our methodology employed zero-shot prompts for task-execution features, measuring effects across a batch of 32 random pairs.
Hardware Specification Yes We use 4 v4 TPU chips running Jax Bradbury et al. (2018) (Equinox Kidger & Garcia (2021)) to train our SAEs. This is about 1 week of v4-8 TPU time.
Software Dependencies No The paper mentions software tools like Jax, Equinox, Huggingface's Flax LM implementations, and Penzai, but does not provide specific version numbers for these software dependencies.
Experiment Setup Yes Our Gemma 1 2B SAEs are trained with a learning rate of 1e-3 and Adam betas of 0.0 and 0.99 for 150M (~100) tokens of Fine Web Penedo et al. (2024). We used a learning rate of 0.15 with the Gemma 1 2B, Phi-3, and Gemma 2 2B 65k models, 0.3 with Gemma 2 2B 16k, and 0.05 with 200 early stopping steps for Gemma 2 9B. We established an optimal steering scale of 15, which we then applied consistently across all subsequent experiments.