Automatically Interpreting Millions of Features in Large Language Models

Authors: Gonçalo Santos Paulo, Alex Troy Mallen, Caden Juang, Nora Belrose

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this work, we build an open-source automated pipeline to generate and evaluate natural language interpretations for SAE features using LLMs. We test our framework on SAEs of varying sizes, activation functions, and losses, trained on two different open-weight LLMs. We introduce five new techniques to score the quality of interpretations...
Researcher Affiliation Collaboration 1Eleuther AI 2Northwestern University, Evanston, Illinois, USA. Correspondence to: Gonc alo Paulo <EMAIL>.
Pseudocode No The paper describes the methodology in prose and through diagrams (Figure 1), but it does not contain any structured pseudocode or algorithm blocks.
Open Source Code Yes Our code is available online. Together with this manuscript, we release the delphi library and accompanying scripts to guide the reproduction of these results.
Open Datasets Yes We collected feature activations from the SAEs over a 10M token sample of Red Pajama-v2 (RPJv2; Computer 2023). Computer, T. Redpajama: an open dataset for training large language models, 2023. URL https://github. com/togethercomputer/Red Pajama-Data.
Dataset Splits Yes Each feature is scored using 100 activating and 100 non-activating examples. The activating examples are chosen via stratified sampling such that there are always 10 examples from each of the 10 deciles of the activation distribution. For each feature, we sample a pool of 40 length-64 prompts from RPJv2 (Computer, 2023), of which 30 i.i.d. prompts are taken for scoring, while the remaining 10 are used by the explainer.
Hardware Specification No The paper mentions that "Core Weave for providing the compute resources" but does not specify any particular hardware (e.g., GPU models, CPU models, or memory) used for the experiments.
Software Dependencies No The paper mentions using specific LLMs like Llama 3.1 70b Instruct and Claude Sonnet 3.5, but does not provide specific versions for any libraries, frameworks, or programming languages used for implementation. It mentions releasing a 'delphi library' but without a version.
Experiment Setup Yes We show the explainer model, Llama 3.1 70b Instruct, 40 different activating examples, each activating example consisting of 32 tokens... For each feature, we sample a pool of 40 length-64 prompts from RPJv2... of which 30 i.i.d. prompts are taken for scoring, while the remaining 10 are used by the explainer. We then sample generations with a maximum of 8 new tokens from the subject model. We tune the intervention strength of each feature to various KL-divergence values on the scoring set with a binary search. The scorer is Llama-3.1 8B base. We use the base model for improved calibration...