AxBench: Steering LLMs? Even Simple Baselines Outperform Sparse Autoencoders

Authors: Zhengxuan Wu, Aryaman Arora, Atticus Geiger, Zheng Wang, Jing Huang, Dan Jurafsky, Christopher D Manning, Christopher Potts

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Therefore, we introduce AXBENCH, a large-scale benchmark for steering and concept detection, and report experiments on Gemma-2-2B and 9B. For steering, we find that prompting outperforms all existing methods, followed by finetuning. For concept detection, representation-based methods such as difference-in-means, perform the best.
Researcher Affiliation Academia 1Department of Computer Science, Stanford University 2Pr(AI)2R Group. Correspondence to: Zhengxuan Wu <EMAIL>, Aryaman Arora <EMAIL>.
Pseudocode No The paper describes methods like Diff Mean, PCA, LAT, Linear Probe, SSV, Re FT-r1, SAE using mathematical formulas and textual explanations (e.g., equations 1-15), but it does not present them in structured pseudocode or algorithm blocks with explicit control flow statements.
Open Source Code Yes github.com/stanfordnlp/axbench. 1We open-source all of our datasets and trained dictionaries at https://huggingface.co/pyvene.
Open Datasets Yes We synthetically generate training and validation datasets (see 3.1) for 500 concepts, which we release as CONCEPT500. [...] We additionally release training and evaluation datasets for all 16K concepts in Gemma Scope as the CONCEPT16K dataset suite.
Dataset Splits Yes We construct a small training dataset Dtrain = {(x+ c,i, y+)}n/2 i=1 {(x c,i, y )}n/2 i=1. with n examples and a concept detection evaluation dataset Dconcept of the same structure and harder examples, where y+ and y are binary labels indicating whether the concept c is present. We set n = 144 for our main experiments. [...] For each concept, we include 144 examples for training and 72 samples for evaluating concept detection.
Hardware Specification No The paper mentions evaluating methods on "Gemma-2-2B and 9B" models, but does not provide specific details about the hardware (e.g., GPU models, CPU types, or cloud configurations) used to run the experiments.
Software Dependencies No The paper mentions using "pyvene" and "PyTorch" as well as "sklearn.decomposition.PCA" and "Adam W", but it does not specify version numbers for these software components.
Experiment Setup Yes To ensure a fair comparison, we perform separate hyperparameter-tuning for each method that requires training. For each method, we conduct separate hyperparameter-tuning on a small CONCEPT10 Dataset containing training and testing datasets only for 10 concepts. [...] Table 8 and Table 9 show hyperparameter settings for methods that require training. [...] We minimise the loss with Adam W with a linear scheduler for all methods that require training.