Validating Mechanistic Interpretations: An Axiomatic Approach

Authors: Nils Palumbo, Ravi Mangal, Zifan Wang, Saranya Vijayakumar, Corina S. Pasareanu, Somesh Jha

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We demonstrate the applicability of our axioms for validating mechanistic interpretations via two case studies2 the existing, well-known analysis by Nanda et al. (2023a) of a model with a Transformer-based architecture (Vaswani et al., 2017) trained to solve modular addition, and a new analysis... Our implementation is available at https://github. com/nilspalumbo/axiomatic-validation.
Researcher Affiliation Collaboration 1University of Wisconsin-Madison 2Colorado State University 3Scale AI 4Carnegie Mellon University.
Pseudocode Yes Listing 1: Abstraction operators for the first block in Python pseudocode.
Open Source Code Yes Our implementation is available at https://github. com/nilspalumbo/axiomatic-validation.
Open Datasets No We construct a dataset of randomly generated 2-SAT formulas with ten clauses and up to five variables, eliminating syntactic duplicates. We use a solver (Z3 (De Moura & Bjørner, 2008)) to check satisfiability for each formula. We built a balanced dataset of 106 SAT and 106 UNSAT instances; we used 60% (also balanced) for training, and the rest for testing. For model analysis, we construct a separate dataset with 105 SAT and 105 UNSAT instances split as before in to train and test sets.
Dataset Splits Yes We built a balanced dataset of 106 SAT and 106 UNSAT instances; we used 60% (also balanced) for training, and the rest for testing. For model analysis, we construct a separate dataset with 105 SAT and 105 UNSAT instances split as before in to train and test sets.
Hardware Specification Yes All our experiments were run on an NVIDIA A100 GPU.
Software Dependencies Yes We use a solver (Z3 (De Moura & Bjørner, 2008)) to check satisfiability for each formula.
Experiment Setup Yes We use a two-layer Re LU decoder-only Transformer with token embeddings of dimension d = 128, learned positional embeddings, one attention head of dimension d in the first layer, four attention heads of dimension d/4 = 32 in the second, and n = 512 hidden units in the MLP (multi-layer perceptron). We use full-batch gradient descent with the Adam W optimizer, setting the learning rate γ = 0.001 and weight decay parameter λ = 1. We perform extensive epochs of training (1000 epochs).