AxBench: Steering LLMs? Even Simple Baselines Outperform Sparse Autoencoders
Authors: Zhengxuan Wu, Aryaman Arora, Atticus Geiger, Zheng Wang, Jing Huang, Dan Jurafsky, Christopher D Manning, Christopher Potts
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Therefore, we introduce AXBENCH, a large-scale benchmark for steering and concept detection, and report experiments on Gemma-2-2B and 9B. For steering, we find that prompting outperforms all existing methods, followed by finetuning. For concept detection, representation-based methods such as difference-in-means, perform the best. |
| Researcher Affiliation | Academia | 1Department of Computer Science, Stanford University 2Pr(AI)2R Group. Correspondence to: Zhengxuan Wu <EMAIL>, Aryaman Arora <EMAIL>. |
| Pseudocode | No | The paper describes methods like Diff Mean, PCA, LAT, Linear Probe, SSV, Re FT-r1, SAE using mathematical formulas and textual explanations (e.g., equations 1-15), but it does not present them in structured pseudocode or algorithm blocks with explicit control flow statements. |
| Open Source Code | Yes | github.com/stanfordnlp/axbench. 1We open-source all of our datasets and trained dictionaries at https://huggingface.co/pyvene. |
| Open Datasets | Yes | We synthetically generate training and validation datasets (see 3.1) for 500 concepts, which we release as CONCEPT500. [...] We additionally release training and evaluation datasets for all 16K concepts in Gemma Scope as the CONCEPT16K dataset suite. |
| Dataset Splits | Yes | We construct a small training dataset Dtrain = {(x+ c,i, y+)}n/2 i=1 {(x c,i, y )}n/2 i=1. with n examples and a concept detection evaluation dataset Dconcept of the same structure and harder examples, where y+ and y are binary labels indicating whether the concept c is present. We set n = 144 for our main experiments. [...] For each concept, we include 144 examples for training and 72 samples for evaluating concept detection. |
| Hardware Specification | No | The paper mentions evaluating methods on "Gemma-2-2B and 9B" models, but does not provide specific details about the hardware (e.g., GPU models, CPU types, or cloud configurations) used to run the experiments. |
| Software Dependencies | No | The paper mentions using "pyvene" and "PyTorch" as well as "sklearn.decomposition.PCA" and "Adam W", but it does not specify version numbers for these software components. |
| Experiment Setup | Yes | To ensure a fair comparison, we perform separate hyperparameter-tuning for each method that requires training. For each method, we conduct separate hyperparameter-tuning on a small CONCEPT10 Dataset containing training and testing datasets only for 10 concepts. [...] Table 8 and Table 9 show hyperparameter settings for methods that require training. [...] We minimise the loss with Adam W with a linear scheduler for all methods that require training. |