reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Everything, Everywhere, All at Once: Is Mechanistic Interpretability Identifiable?

Authors: Maxime Méloux, Silviu Maniu, François Portet, Maxime Peyrard

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our experiments reveal overwhelming evidence of non-identifiability in all cases: multiple circuits can replicate model behavior, multiple interpretations can exist for a circuit, several algorithms can be causally aligned with the neural network, and a single algorithm can be causally aligned with different subspaces of the network. We stress-test the identifiability properties of current MI criteria by conducting experiments in a controlled, small-scale setting. Using simple tasks like learning Boolean functions and very small multi-layer perceptrons (MLPs), we search for Boolean circuit explanations aiming to discover which succession of logic gates is implemented by the MLPs.
Researcher Affiliation	Academia	Maxime M eloux, Franc ois Portet, Silviu Maniu, Maxime Peyrard Universit e Grenoble Alpes, CNRS, Grenoble INP, LIG, 38000 Grenoble, France EMAIL
Pseudocode	No	The paper describes methodologies (where-then-what and what-then-where) verbally and through definitions, but it does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code	Yes	The code and parameters used to conduct this paper s experiments can be found on Git Hub.
Open Datasets	Yes	For example, we trained a larger MLP on a subset of the MNIST dataset (Deng, 2012), filtered to contain only the digits 0 and 1.
Dataset Splits	No	The paper mentions using a subset of the MNIST dataset filtered to digits 0 and 1, and using binary samples with Gaussian noise for Boolean functions. It does not provide specific training, validation, or test split percentages or counts for any dataset.
Hardware Specification	No	It was granted access to the HPC resources of IDRIS under the allocation 2025-AD011014834 made by GENCI.
Software Dependencies	No	The paper does not explicitly mention any specific software dependencies or their version numbers.
Experiment Setup	Yes	The MLP is trained on binary inputs with a single logit output to produce the XOR behavior. The inputs are 0 or 1 and can have a randomly sampled Gaussian noise of a fixed standard deviation. [...] training is performed on binary samples with added Gaussian noise and continues until the network mean squared loss is lower than n 10 3. [...] We choose n 2-input logic gates L1, . . . , Ln, generate a multilayer perceptron (MLP) N with layer sizes (2, k, k, n), and train N to implement the gates L1, . . . Ln. [...] We obtained a regression model with layer sizes (784, 128, 128, 3, 3, 3, 1).