Everything, Everywhere, All at Once: Is Mechanistic Interpretability Identifiable?
Authors: Maxime Méloux, Silviu Maniu, François Portet, Maxime Peyrard
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our experiments reveal overwhelming evidence of non-identifiability in all cases: multiple circuits can replicate model behavior, multiple interpretations can exist for a circuit, several algorithms can be causally aligned with the neural network, and a single algorithm can be causally aligned with different subspaces of the network. We stress-test the identifiability properties of current MI criteria by conducting experiments in a controlled, small-scale setting. Using simple tasks like learning Boolean functions and very small multi-layer perceptrons (MLPs), we search for Boolean circuit explanations aiming to discover which succession of logic gates is implemented by the MLPs. |
| Researcher Affiliation | Academia | Maxime M eloux, Franc ois Portet, Silviu Maniu, Maxime Peyrard Universit e Grenoble Alpes, CNRS, Grenoble INP, LIG, 38000 Grenoble, France EMAIL |
| Pseudocode | No | The paper describes methodologies (where-then-what and what-then-where) verbally and through definitions, but it does not include any explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | The code and parameters used to conduct this paper s experiments can be found on Git Hub. |
| Open Datasets | Yes | For example, we trained a larger MLP on a subset of the MNIST dataset (Deng, 2012), filtered to contain only the digits 0 and 1. |
| Dataset Splits | No | The paper mentions using a subset of the MNIST dataset filtered to digits 0 and 1, and using binary samples with Gaussian noise for Boolean functions. It does not provide specific training, validation, or test split percentages or counts for any dataset. |
| Hardware Specification | No | It was granted access to the HPC resources of IDRIS under the allocation 2025-AD011014834 made by GENCI. |
| Software Dependencies | No | The paper does not explicitly mention any specific software dependencies or their version numbers. |
| Experiment Setup | Yes | The MLP is trained on binary inputs with a single logit output to produce the XOR behavior. The inputs are 0 or 1 and can have a randomly sampled Gaussian noise of a fixed standard deviation. [...] training is performed on binary samples with added Gaussian noise and continues until the network mean squared loss is lower than n 10 3. [...] We choose n 2-input logic gates L1, . . . , Ln, generate a multilayer perceptron (MLP) N with layer sizes (2, k, k, n), and train N to implement the gates L1, . . . Ln. [...] We obtained a regression model with layer sizes (784, 128, 128, 3, 3, 3, 1). |