Mechanistic Permutability: Match Features Across Layers

Authors: Nikita Balagansky, Ian Maksimov, Daniil Gavrilov

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Through extensive experiments on the Gemma 2 language model, we demonstrate that our method effectively captures feature evolution across layers, improving feature matching quality. We also show that features persist over several layers and that our approach can approximate hidden states across layers.
Researcher Affiliation Collaboration Nikita Balagansky1,2 , Ian Maksimov3,1, Daniil Gavrilov1 1 T-Tech, 2 Moscow Institute of Physics and Technologies, 3 HSE University EMAIL
Pseudocode No The paper describes the methods using mathematical formulas and conceptual descriptions, such as equations (1), (2), (4) and figures like Figure 2, which illustrates the folding process. However, there are no clearly labeled pseudocode or algorithm blocks with structured steps in a code-like format.
Open Source Code No The paper does not provide an explicit statement about releasing its own source code or a direct link to a code repository for the 'SAE Match' methodology described. It refers to open-sourced SAEs from other works (Lieberum et al., 2024) and datasets, but not its specific implementation code.
Open Datasets Yes We tested our methods on subsets of Open Web Text (Gokaslan et al., 2019), Code2, and Wiki Text (Merity et al., 2016). From each dataset, we randomly sampled 100 examples, truncated them to 1,024 tokens, and excluded the beginning-of-sequence (BOS) token when calculating metrics. Code2: https://huggingface.co/datasets/loubnabnl/github-small-near-dedup
Dataset Splits No From each dataset, we randomly sampled 100 examples, truncated them to 1,024 tokens, and excluded the beginning-of-sequence (BOS) token when calculating metrics. This describes a sampling strategy but does not specify training, validation, or test splits for models or evaluation.
Hardware Specification No The paper does not provide any specific details about the hardware (e.g., GPU models, CPU types, memory) used to run the experiments.
Software Dependencies No The paper mentions using 'GPT-4o' and 'GPT-4o mini' for external LLM evaluation, but these are models/services, not specific software dependencies with version numbers for the core methodology's implementation (e.g., Python, PyTorch versions). No other software dependencies with version numbers are listed.
Experiment Setup Yes For matching we used MSE from both decoder and encoder layers. During our initial experiments, we observed that the decoder-only option performs similarly to our scheme, while the encoder-only suffers from poor quality of matching (see Appendix Figure 12 for comparison). Each experiment involved approximately 1,600 LLM evaluations over 100 feature paths spanning 16 layers (details in Appendix Section C). We tested our methods on subsets of Open Web Text (Gokaslan et al., 2019), Code2, and Wiki Text (Merity et al., 2016). From each dataset, we randomly sampled 100 examples, truncated them to 1,024 tokens, and excluded the beginning-of-sequence (BOS) token when calculating metrics.