reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Analyze Feature Flow to Enhance Interpretation and Steering in Language Models

Authors: Daniil Laptev, Nikita Balagansky, Yaroslav Aksenov, Daniil Gavrilov

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We conduct our main experiments with the Gemma 2 2B model (Gemma Team, 2024) and the Gemma Scope SAE pack (Lieberum et al., 2024)... We design our experiments to analyze how residual features emerge, propagate, and can be manipulated across model layers. Specifically, we aim to: (i) determine how features originate in different model components, (ii) assess whether deactivating a predecessor feature truly deactivates its descendant, and (iii) use these insights to steer the model s generation toward or away from specific topics. Below is a concise summary of each experiment. See Appendices A and B for detailed setup. Identification of feature predecessors. We first verify that cosine similarity relations used for single-layer analysis align with actual activation correlations. A target feature in the residual stream RL is matched with the previous residual RL 1, the MLP output M, or the attention output A features. If none are active, we label it From nowhere. By applying this process on four diverse datasets, we confirm the above-stated relation, and we also analyze how these groups are distributed across layers.
Researcher Affiliation	Collaboration	Daniil Laptev 1 2 Nikita Balagansky 1 2 Yaroslav Aksenov 1 Daniil Gavrilov 1 1T-Tech 2Moscow Institute of Physics and Technology. Correspondence to: Nikita Balagansky <EMAIL>.
Pseudocode	No	The paper describes the methodology for cross-layer feature evolution, mechanistic properties of flow graphs, and multi-layer model steering in paragraph text (Sections 3.2, 3.3, 3.4, 3.5) without presenting structured pseudocode or algorithm blocks.
Open Source Code	No	The paper does not provide an explicit statement about releasing source code for the methodology described, nor does it include a link to a code repository. It mentions using 'Gemma Scope SAE pack' and 'LLama Scope', which are third-party resources.
Open Datasets	Yes	We use four datasets for this analysis: Fine Web (Penedo et al., 2024) (general-purpose texts), Tiny Stories (Eldan & Li, 2023) (short synthetic stories), Auto Math Text (Zhang et al., 2024) (math-related texts), and Python Github Code3 (pure Python code).
Dataset Splits	Yes	From each dataset, we select 250 random samples; for each sample, we pick 5 random tokens (excluding the BOS token).
Hardware Specification	No	The paper does not provide specific details about the hardware used for running experiments, such as GPU or CPU models, or cloud computing specifications.
Software Dependencies	No	The paper mentions using the 'Gemma 2 2B model' and 'Gemma Scope SAE pack' and refers to 'Python Github Code' and 'Neuronpedia' for interpretations, but it does not specify version numbers for Python, any deep learning frameworks, or other key software libraries.
Experiment Setup	Yes	We use the prompt, I think that the biggest problem of contemporary theoretical physics is , and generate text with a maximum length of 96 tokens, topp = 0.7, and temperature T = 1.27. To determine whether each theme is present in the generated text, we query a gpt4o-mini language model for a score from 0 to 5 on each theme, following an approach similar to Chalnev et al. (2024). We set α = 0.05 and s = 1, based on generating a small batch of test completions and manually checking the trade-off between coherence and theme intensity. After that, we steer the resulting features with manually obtained s = 8 and α = 0.05 for the single-layer case, and s = 3 and α = 0.25 with the exponential decrease method for the cumulative setting.