Causal Abstraction: A Theoretical Foundation for Mechanistic Interpretability

Authors: Atticus Geiger, Duligur Ibeling, Amir Zur, Maheep Chaudhary, Sonakshi Chauhan, Jing Huang, Aryaman Arora, Zhengxuan Wu, Noah Goodman, Christopher Potts, Thomas Icard

JMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Theoretical Causal abstraction provides a theoretical foundation for mechanistic interpretability, the field concerned with providing intelligible algorithms that are faithful simplifications of the known, but opaque low-level details of black box AI models. Our contributions are (1) generalizing the theory of causal abstraction from mechanism replacement (i.e., hard and soft interventions) to arbitrary mechanism transformation (i.e., functionals from old mechanisms to new mechanisms), (2) providing a flexible, yet precise formalization for the core concepts of polysemantic neurons, the linear representation hypothesis, modular features, and graded faithfulness, and (3) unifying a variety of mechanistic interpretability methods in the common language of causal abstraction...
Researcher Affiliation Academia Atticus Geiger , Duligur Ibeling , Amir Zur , Maheep Chaudhary , Sonakshi Chauhan , Jing Huang , Aryaman Arora , Zhengxuan Wu , Noah Goodman , Christopher Potts , Thomas Icard Pr(Ai)2R Group Stanford University Corresponding authors: EMAIL; EMAIL
Pseudocode Yes Algorithm 1: Scrub(b, H) 2 for H H do 3 if H XIn L then 4 h h Proj H(b) 7 for G {G : (G, H) C} do 8 s Val XIn L 9 g g Scrub(s, {G}) 10 if H Domain(δ) then 11 s {s Val XIn L : Projδ(H)(Solve(Hs)) = Projδ(H)(Solve(Hb))} 12 g g Scrub(s, {G : (G, H) C}) 13 h h Proj H(Solve(Lg)) 14 return h
Open Source Code Yes We provide a companion jupyter notebook that walks through this example.
Open Datasets No The paper uses illustrative examples like the hierarchical equality task and bubble sort algorithm to demonstrate its theoretical framework. It refers to benchmarks like CEBa B in an illustrative capacity, not as datasets for empirical experiments presented in the paper. Therefore, it does not provide access information for specific datasets used for empirical evaluation.
Dataset Splits No The paper is theoretical, presenting a causal abstraction framework and illustrative examples. It does not conduct empirical experiments on specific datasets with train/test/validation splits. Therefore, no dataset split information is provided.
Hardware Specification No The paper primarily focuses on theoretical contributions and illustrative examples of causal abstraction. It does not describe any specific experiments that would require detailing hardware specifications such as GPU or CPU models.
Software Dependencies No The paper discusses theoretical concepts and provides a conceptual framework. While it mentions a 'companion jupyter notebook,' it does not specify any software libraries or solvers with version numbers that would be necessary to reproduce experiments.
Experiment Setup No The paper presents a theoretical foundation and illustrates it with examples. It does not conduct empirical experiments that would require detailing hyperparameters, model initialization, training schedules, or other specific experimental setup details.