Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1]

Mechanistic Interpretability for AI Safety - A Review

Authors: Leonard Bereska, Stratis Gavves

TMLR 2024 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Theoretical This review explores mechanistic interpretability: reverse engineering the computational mechanisms and representations learned by neural networks into human-understandable algorithms and concepts to provide a granular, causal understanding. We establish foundational concepts such as features encoding knowledge within neural activations and hypotheses about their representation and computation. We survey methodologies for causally dissecting model behaviors and assess the relevance of mechanistic interpretability to AI safety.
Researcher Affiliation Academia Leonard Bereska Efstratios Gavves EMAIL University of Amsterdam
Pseudocode Yes Sparse autoencoders (Cunningham et al., 2024) are proposed as a solution to polysemantic neurons. The problem of superposition is mathematically formalized as sparse dictionary learning (Olshausen & Field, 1997) problem to decompose neural network activations into disentangled component features. The goal is to learn a dictionary of vectors {fk}nfeat k=1 Rd that can represent the unknown, ground truth network features as sparse linear combinations. If successful, the learned dictionary contains monosemantic neurons corresponding to features (Sharkey et al., 2022b). The autoencoder architecture consists of an encoder and a Re LU activation function, expanding the input dimensionality to dhid > din. The encoder s output is given by: h = Re LU(Wencx + b), (1) x = Wdech = i=0 hifi, (2) where Wenc, W dec Rdhid din and b Rdhid. The parameter matrix Wdec forms the feature dictionary, with rows fi as dictionary features. The autoencoder is trained to minimize the loss, where the L1 penalty on h encourages sparse reconstructions using the dictionary features, L(x) = |x x |2 2 + α|h|1. (3)
Open Source Code No The paper is a review of mechanistic interpretability research and does not present new methodology with associated code. The text refers to an HTML version of the paper itself, not source code for new methods.
Open Datasets No This paper is a review and does not present its own experiments or datasets. It refers to datasets used in other research (e.g., 'chess', 'Othello' transcripts, 'CIFAR-10' standard split), but does not provide access information for datasets used in its own work.
Dataset Splits No This paper is a review and does not present its own experiments or dataset splits. It mentions other research using 'standard benchmark splits', but does not specify splits for any experiments conducted within this paper.
Hardware Specification No This paper is a review and does not report on experimental results obtained using specific hardware. Therefore, no hardware specifications are provided.
Software Dependencies No This paper is a review and does not describe a novel methodology requiring specific software dependencies and versions for replication. It mentions various tools and frameworks used in the field of mechanistic interpretability generally but does not list specific software dependencies with version numbers for its own work.
Experiment Setup No This paper is a review and does not present its own experiments. Therefore, no experimental setup details, including hyperparameters or training configurations, are provided.