Automatically Interpreting Millions of Features in Large Language Models
Authors: Gonçalo Santos Paulo, Alex Troy Mallen, Caden Juang, Nora Belrose
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this work, we build an open-source automated pipeline to generate and evaluate natural language interpretations for SAE features using LLMs. We test our framework on SAEs of varying sizes, activation functions, and losses, trained on two different open-weight LLMs. We introduce five new techniques to score the quality of interpretations... |
| Researcher Affiliation | Collaboration | 1Eleuther AI 2Northwestern University, Evanston, Illinois, USA. Correspondence to: Gonc alo Paulo <EMAIL>. |
| Pseudocode | No | The paper describes the methodology in prose and through diagrams (Figure 1), but it does not contain any structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | Our code is available online. Together with this manuscript, we release the delphi library and accompanying scripts to guide the reproduction of these results. |
| Open Datasets | Yes | We collected feature activations from the SAEs over a 10M token sample of Red Pajama-v2 (RPJv2; Computer 2023). Computer, T. Redpajama: an open dataset for training large language models, 2023. URL https://github. com/togethercomputer/Red Pajama-Data. |
| Dataset Splits | Yes | Each feature is scored using 100 activating and 100 non-activating examples. The activating examples are chosen via stratified sampling such that there are always 10 examples from each of the 10 deciles of the activation distribution. For each feature, we sample a pool of 40 length-64 prompts from RPJv2 (Computer, 2023), of which 30 i.i.d. prompts are taken for scoring, while the remaining 10 are used by the explainer. |
| Hardware Specification | No | The paper mentions that "Core Weave for providing the compute resources" but does not specify any particular hardware (e.g., GPU models, CPU models, or memory) used for the experiments. |
| Software Dependencies | No | The paper mentions using specific LLMs like Llama 3.1 70b Instruct and Claude Sonnet 3.5, but does not provide specific versions for any libraries, frameworks, or programming languages used for implementation. It mentions releasing a 'delphi library' but without a version. |
| Experiment Setup | Yes | We show the explainer model, Llama 3.1 70b Instruct, 40 different activating examples, each activating example consisting of 32 tokens... For each feature, we sample a pool of 40 length-64 prompts from RPJv2... of which 30 i.i.d. prompts are taken for scoring, while the remaining 10 are used by the explainer. We then sample generations with a maximum of 8 new tokens from the subject model. We tune the intervention strength of each feature to various KL-divergence values on the scoring set with a binary search. The scorer is Llama-3.1 8B base. We use the base model for improved calibration... |