reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Automatically Interpreting Millions of Features in Large Language Models

Authors: Gonçalo Santos Paulo, Alex Troy Mallen, Caden Juang, Nora Belrose

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this work, we build an open-source automated pipeline to generate and evaluate natural language interpretations for SAE features using LLMs. We test our framework on SAEs of varying sizes, activation functions, and losses, trained on two different open-weight LLMs. We introduce five new techniques to score the quality of interpretations...
Researcher Affiliation	Collaboration	1Eleuther AI 2Northwestern University, Evanston, Illinois, USA. Correspondence to: Gonc alo Paulo <EMAIL>.
Pseudocode	No	The paper describes the methodology in prose and through diagrams (Figure 1), but it does not contain any structured pseudocode or algorithm blocks.
Open Source Code	Yes	Our code is available online. Together with this manuscript, we release the delphi library and accompanying scripts to guide the reproduction of these results.
Open Datasets	Yes	We collected feature activations from the SAEs over a 10M token sample of Red Pajama-v2 (RPJv2; Computer 2023). Computer, T. Redpajama: an open dataset for training large language models, 2023. URL https://github. com/togethercomputer/Red Pajama-Data.
Dataset Splits	Yes	Each feature is scored using 100 activating and 100 non-activating examples. The activating examples are chosen via stratified sampling such that there are always 10 examples from each of the 10 deciles of the activation distribution. For each feature, we sample a pool of 40 length-64 prompts from RPJv2 (Computer, 2023), of which 30 i.i.d. prompts are taken for scoring, while the remaining 10 are used by the explainer.
Hardware Specification	No	The paper mentions that "Core Weave for providing the compute resources" but does not specify any particular hardware (e.g., GPU models, CPU models, or memory) used for the experiments.
Software Dependencies	No	The paper mentions using specific LLMs like Llama 3.1 70b Instruct and Claude Sonnet 3.5, but does not provide specific versions for any libraries, frameworks, or programming languages used for implementation. It mentions releasing a 'delphi library' but without a version.
Experiment Setup	Yes	We show the explainer model, Llama 3.1 70b Instruct, 40 different activating examples, each activating example consisting of 32 tokens... For each feature, we sample a pool of 40 length-64 prompts from RPJv2... of which 30 i.i.d. prompts are taken for scoring, while the remaining 10 are used by the explainer. We then sample generations with a maximum of 8 new tokens from the subject model. We tune the intervention strength of each feature to various KL-divergence values on the scoring set with a binary search. The scorer is Llama-3.1 8B base. We use the base model for improved calibration...