Validating Mechanistic Interpretations: An Axiomatic Approach
Authors: Nils Palumbo, Ravi Mangal, Zifan Wang, Saranya Vijayakumar, Corina S. Pasareanu, Somesh Jha
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We demonstrate the applicability of our axioms for validating mechanistic interpretations via two case studies2 the existing, well-known analysis by Nanda et al. (2023a) of a model with a Transformer-based architecture (Vaswani et al., 2017) trained to solve modular addition, and a new analysis... Our implementation is available at https://github. com/nilspalumbo/axiomatic-validation. |
| Researcher Affiliation | Collaboration | 1University of Wisconsin-Madison 2Colorado State University 3Scale AI 4Carnegie Mellon University. |
| Pseudocode | Yes | Listing 1: Abstraction operators for the first block in Python pseudocode. |
| Open Source Code | Yes | Our implementation is available at https://github. com/nilspalumbo/axiomatic-validation. |
| Open Datasets | No | We construct a dataset of randomly generated 2-SAT formulas with ten clauses and up to five variables, eliminating syntactic duplicates. We use a solver (Z3 (De Moura & Bjørner, 2008)) to check satisfiability for each formula. We built a balanced dataset of 106 SAT and 106 UNSAT instances; we used 60% (also balanced) for training, and the rest for testing. For model analysis, we construct a separate dataset with 105 SAT and 105 UNSAT instances split as before in to train and test sets. |
| Dataset Splits | Yes | We built a balanced dataset of 106 SAT and 106 UNSAT instances; we used 60% (also balanced) for training, and the rest for testing. For model analysis, we construct a separate dataset with 105 SAT and 105 UNSAT instances split as before in to train and test sets. |
| Hardware Specification | Yes | All our experiments were run on an NVIDIA A100 GPU. |
| Software Dependencies | Yes | We use a solver (Z3 (De Moura & Bjørner, 2008)) to check satisfiability for each formula. |
| Experiment Setup | Yes | We use a two-layer Re LU decoder-only Transformer with token embeddings of dimension d = 128, learned positional embeddings, one attention head of dimension d in the first layer, four attention heads of dimension d/4 = 32 in the second, and n = 512 hidden units in the MLP (multi-layer perceptron). We use full-batch gradient descent with the Adam W optimizer, setting the learning rate γ = 0.001 and weight decay parameter λ = 1. We perform extensive epochs of training (1000 epochs). |