Open Problems in Mechanistic Interpretability

Authors: Lee Sharkey, Bilal Chughtai, Joshua Batson, Jack Lindsey, Jeffrey Wu, Lucius Bushnaq, Nicholas Goldowsky-Dill, Stefan Heimersheim, Alejandro Ortega, Joseph Isaac Bloom, Stella Biderman, AdriĆ  Garriga-Alonso, Arthur Conmy, Neel Nanda, Jessica Mary Rumbelow, Martin Wattenberg, Nandi Schoots, Joseph Miller, William Saunders, Eric J Michaud, Stephen Casper, Max Tegmark, David Bau, Eric Todd, Atticus Geiger, Mor Geva, Jesse Hoogland, Daniel Murfet, Thomas McGrath

TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Theoretical This forward-facing review discusses the current frontier of mechanistic interpretability and the open problems that the field may benefit from prioritizing.
Researcher Affiliation Collaboration Lee Sharkey Apollo Research Bilal Chughtai Apollo Research Joshua Batson Anthropic Jack Lindsey Anthropic Jeff Wu Anthropic Lucius Bushnaq Apollo Research Nicholas Goldowsky-Dill Apollo Research Stefan Heimersheim Apollo Research Alejandro Ortega Apollo Research Joseph Bloom Decode Research Stella Biderman Eleuther AI Adria Garriga-Alonso FAR AI Arthur Conmy Google Deep Mind Neel Nanda Google Deep Mind Jessica Rumbelow Leap Laboratories Martin Wattenberg Harvard University Nandi Schoots King s College London and Imperial College London Joseph Miller MATS William Saunders METR Eric J. Michaud MIT Stephen Casper MIT Max Tegmark MIT David Bau Northeastern University Eric Todd Northeastern University Atticus Geiger Pr(AI)2r group Mor Geva Tel Aviv University Jesse Hoogland Timaeus Daniel Murfet University of Melbourne Tom Mc Grath Goodfire
Pseudocode No The paper describes methods conceptually and provides figures illustrating concepts (e.g., Figure 2, Figure 3), but it does not contain any formal pseudocode or algorithm blocks.
Open Source Code No The paper does not contain any explicit statements about releasing source code for the methodology described, nor does it provide a direct link to a code repository.
Open Datasets No The paper is a review of open problems in mechanistic interpretability and does not conduct its own experiments or directly use datasets for empirical validation, therefore it does not provide concrete access information for a dataset it uses.
Dataset Splits No The paper is a review and does not conduct its own experiments, therefore it does not provide dataset split information for reproducibility.
Hardware Specification No The paper is a review of open problems and does not perform any experiments, therefore no hardware specifications are provided.
Software Dependencies No The paper is a review and does not describe a new methodology that requires specific software dependencies with version numbers for replication.
Experiment Setup No The paper is a forward-facing review discussing open problems and does not present its own experimental results, thus it does not include details on an experimental setup or hyperparameters.