Scaling Sparse Feature Circuits For Studying In-Context Learning
Authors: Dmitrii Kharlapenko, Stepan Shabalin, Arthur Conmy, Neel Nanda
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate TVC against four baseline approaches... We conducted extensive parameter sweeps... To validate the causal relevance of our decomposed task features, we conducted a series of steering experiments... To evaluate our SFC modifications, we measured faithfulness through ablation studies on our ICL task dataset. |
| Researcher Affiliation | Academia | 1ETH Zurich, Switzerland 2Georgia Institute of Technology, US |
| Pseudocode | Yes | Algorithm 1. Pseudocode for Task Vector Cleaning. Algorithm 2. Pseudocode for Sparse Feature Circuits indirect effect calculation. |
| Open Source Code | No | We will also plan to share SAE training codebase in JAX with a full suite of SAEs for Gemma 1 2B after the paper publication. Our SAEs and training code will be made public after paper publication. |
| Open Datasets | Yes | Our dataset for circuit finding is primarily derived from the function vectors paper Todd et al. (2024), which provides a diverse set of tasks for evaluating the existence and properties of function vectors in language models. We train residual and attention output SAEs as well as transcoders for layers 1-18 of the model on Fine Web Penedo et al. (2024). |
| Dataset Splits | Yes | The cleaning process is performed on a training batch of 24 pairs, with evaluation conducted on an additional 24 pairs. All prompts are zero-shot. Our methodology employed zero-shot prompts for task-execution features, measuring effects across a batch of 32 random pairs. |
| Hardware Specification | Yes | We use 4 v4 TPU chips running Jax Bradbury et al. (2018) (Equinox Kidger & Garcia (2021)) to train our SAEs. This is about 1 week of v4-8 TPU time. |
| Software Dependencies | No | The paper mentions software tools like Jax, Equinox, Huggingface's Flax LM implementations, and Penzai, but does not provide specific version numbers for these software dependencies. |
| Experiment Setup | Yes | Our Gemma 1 2B SAEs are trained with a learning rate of 1e-3 and Adam betas of 0.0 and 0.99 for 150M (~100) tokens of Fine Web Penedo et al. (2024). We used a learning rate of 0.15 with the Gemma 1 2B, Phi-3, and Gemma 2 2B 65k models, 0.3 with Gemma 2 2B 16k, and 0.05 with 200 early stopping steps for Gemma 2 9B. We established an optimal steering scale of 15, which we then applied consistently across all subsequent experiments. |