reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

MIB: A Mechanistic Interpretability Benchmark

Authors: Aaron Mueller, Atticus Geiger, Sarah Wiegreffe, Dana Arad, Iván Arcuschin, Adam Belfki, Yik Siu Chan, Jaden Fried Fiotto-Kaufman, Tal Haklay, Michael Hanna, Jing Huang, Rohan Gupta, Yaniv Nikankin, Hadas Orgad, Nikhil Prakash, Anja Reusch, Aruna Sankaranarayanan, Shun Shao, Alessandro Stolfo, Martin Tutek, Amir Zur, David Bau, Yonatan Belinkov

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Using MIB, we find that attribution and mask optimization methods perform best on circuit localization. For causal variable localization, we find that the supervised DAS method performs best, while SAE features are not better than neurons, i.e., non-featurized hidden vectors. These findings illustrate that MIB enables meaningful comparisons, and increases our confidence that there has been real progress in the field.
Researcher Affiliation	Collaboration	1Boston University 2Pr(AI)2R Group 3Ai2 4Technion 5University of Buenos Aires 6Northeastern University 7Brown University 8University of Amsterdam 9Stanford University 10Independent 11MIT 12Cambridge University 13ETH Z urich. Correspondence to: Aaron Mueller <EMAIL>.
Pseudocode	No	The paper includes a figure (Figure 1) that provides an overview of MIB with a flow-chart like diagram, but it does not contain any structured pseudocode or algorithm blocks. Methods are described textually.
Open Source Code	Yes	Code is available on Git Hub.
Open Datasets	Yes	Our datasets are available on Hugging Face. Code is available on Git Hub. The leaderboard is hosted at this URL.
Dataset Splits	Yes	The number of instances in each dataset and split is summarized in Table 5 (App. D). Each task comes with a training split on which users can discover circuits or causal variables, and a validation split on which users can tune their methods or hyperparameters. We also create two test sets per task: public and private.
Hardware Specification	Yes	The final model was trained for 70 hours on a single H100 GPU.
Software Dependencies	No	The paper mentions using specific tools and libraries like "pyvene library (Wu et al., 2024)", "Gemma Scope (Lieberum et al., 2024)", and "Llama Scope (He et al., 2024)", but it does not provide specific version numbers for these software components or any other ancillary software.
Experiment Setup	Yes	The learning rate used across models and tasks was 0.01, except for IOI which we used learning rate of 1.0. No regularization loss terms were used. Epochs and batch size. For RAVEL, we train for one epoch of 30k examples with a batch size of 128 for Llama and 32 for Gemma. For MCQA, we train for 8 epochs on 300 examples with a batch size of 64. For ARC (easy), we train for 2 epochs of 9k examples and a batch size of 48 with Gemma and for 1 epoch with a batch size of 16 for Llama. For the two-digit addition task, we train for 1 epoch on 30k examples with a batch size of 256. For IOI, we train for one epoch on 30k examples. DAS dimensionality. The dimensionality of DAS was set at 16 for the ordering ID XOrder and carry-the-one variable XCarry. The DAS dimensionality for STok and SPos are 32. The OAnswer variable in MCQA and ARC (Easy) has a DAS dimensionality of half the residual stream for their respective model, because token embeddings live in a higher dimensional space. The RAVEL task which had the dimensionality of an eighth of the residual stream, according to the experiments from (Huang et al., 2024a). Masking parameters. For the masking methods, the temperature schedule used begins at 1.0 and approaches 0.01.