Analyze Feature Flow to Enhance Interpretation and Steering in Language Models
Authors: Daniil Laptev, Nikita Balagansky, Yaroslav Aksenov, Daniil Gavrilov
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We conduct our main experiments with the Gemma 2 2B model (Gemma Team, 2024) and the Gemma Scope SAE pack (Lieberum et al., 2024)... We design our experiments to analyze how residual features emerge, propagate, and can be manipulated across model layers. Specifically, we aim to: (i) determine how features originate in different model components, (ii) assess whether deactivating a predecessor feature truly deactivates its descendant, and (iii) use these insights to steer the model s generation toward or away from specific topics. Below is a concise summary of each experiment. See Appendices A and B for detailed setup. Identification of feature predecessors. We first verify that cosine similarity relations used for single-layer analysis align with actual activation correlations. A target feature in the residual stream RL is matched with the previous residual RL 1, the MLP output M, or the attention output A features. If none are active, we label it From nowhere. By applying this process on four diverse datasets, we confirm the above-stated relation, and we also analyze how these groups are distributed across layers. |
| Researcher Affiliation | Collaboration | Daniil Laptev 1 2 Nikita Balagansky 1 2 Yaroslav Aksenov 1 Daniil Gavrilov 1 1T-Tech 2Moscow Institute of Physics and Technology. Correspondence to: Nikita Balagansky <EMAIL>. |
| Pseudocode | No | The paper describes the methodology for cross-layer feature evolution, mechanistic properties of flow graphs, and multi-layer model steering in paragraph text (Sections 3.2, 3.3, 3.4, 3.5) without presenting structured pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not provide an explicit statement about releasing source code for the methodology described, nor does it include a link to a code repository. It mentions using 'Gemma Scope SAE pack' and 'LLama Scope', which are third-party resources. |
| Open Datasets | Yes | We use four datasets for this analysis: Fine Web (Penedo et al., 2024) (general-purpose texts), Tiny Stories (Eldan & Li, 2023) (short synthetic stories), Auto Math Text (Zhang et al., 2024) (math-related texts), and Python Github Code3 (pure Python code). |
| Dataset Splits | Yes | From each dataset, we select 250 random samples; for each sample, we pick 5 random tokens (excluding the BOS token). |
| Hardware Specification | No | The paper does not provide specific details about the hardware used for running experiments, such as GPU or CPU models, or cloud computing specifications. |
| Software Dependencies | No | The paper mentions using the 'Gemma 2 2B model' and 'Gemma Scope SAE pack' and refers to 'Python Github Code' and 'Neuronpedia' for interpretations, but it does not specify version numbers for Python, any deep learning frameworks, or other key software libraries. |
| Experiment Setup | Yes | We use the prompt, I think that the biggest problem of contemporary theoretical physics is , and generate text with a maximum length of 96 tokens, topp = 0.7, and temperature T = 1.27. To determine whether each theme is present in the generated text, we query a gpt4o-mini language model for a score from 0 to 5 on each theme, following an approach similar to Chalnev et al. (2024). We set α = 0.05 and s = 1, based on generating a small batch of test completions and manually checking the trade-off between coherence and theme intensity. After that, we steer the resulting features with manually obtained s = 8 and α = 0.05 for the single-layer case, and s = 3 and α = 0.25 with the exponential decrease method for the cumulative setting. |