Everything, Everywhere, All at Once: Is Mechanistic Interpretability Identifiable?

Authors: Maxime Méloux, Silviu Maniu, François Portet, Maxime Peyrard

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our experiments reveal overwhelming evidence of non-identifiability in all cases: multiple circuits can replicate model behavior, multiple interpretations can exist for a circuit, several algorithms can be causally aligned with the neural network, and a single algorithm can be causally aligned with different subspaces of the network. We stress-test the identifiability properties of current MI criteria by conducting experiments in a controlled, small-scale setting. Using simple tasks like learning Boolean functions and very small multi-layer perceptrons (MLPs), we search for Boolean circuit explanations aiming to discover which succession of logic gates is implemented by the MLPs.
Researcher Affiliation Academia Maxime M eloux, Franc ois Portet, Silviu Maniu, Maxime Peyrard Universit e Grenoble Alpes, CNRS, Grenoble INP, LIG, 38000 Grenoble, France EMAIL
Pseudocode No The paper describes methodologies (where-then-what and what-then-where) verbally and through definitions, but it does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code Yes The code and parameters used to conduct this paper s experiments can be found on Git Hub.
Open Datasets Yes For example, we trained a larger MLP on a subset of the MNIST dataset (Deng, 2012), filtered to contain only the digits 0 and 1.
Dataset Splits No The paper mentions using a subset of the MNIST dataset filtered to digits 0 and 1, and using binary samples with Gaussian noise for Boolean functions. It does not provide specific training, validation, or test split percentages or counts for any dataset.
Hardware Specification No It was granted access to the HPC resources of IDRIS under the allocation 2025-AD011014834 made by GENCI.
Software Dependencies No The paper does not explicitly mention any specific software dependencies or their version numbers.
Experiment Setup Yes The MLP is trained on binary inputs with a single logit output to produce the XOR behavior. The inputs are 0 or 1 and can have a randomly sampled Gaussian noise of a fixed standard deviation. [...] training is performed on binary samples with added Gaussian noise and continues until the network mean squared loss is lower than n 10 3. [...] We choose n 2-input logic gates L1, . . . , Ln, generate a multilayer perceptron (MLP) N with layer sizes (2, k, k, n), and train N to implement the gates L1, . . . Ln. [...] We obtained a regression model with layer sizes (784, 128, 128, 3, 3, 3, 1).