reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

AxBench: Steering LLMs? Even Simple Baselines Outperform Sparse Autoencoders

Authors: Zhengxuan Wu, Aryaman Arora, Atticus Geiger, Zheng Wang, Jing Huang, Dan Jurafsky, Christopher D Manning, Christopher Potts

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Therefore, we introduce AXBENCH, a large-scale benchmark for steering and concept detection, and report experiments on Gemma-2-2B and 9B. For steering, we find that prompting outperforms all existing methods, followed by finetuning. For concept detection, representation-based methods such as difference-in-means, perform the best.
Researcher Affiliation	Academia	1Department of Computer Science, Stanford University 2Pr(AI)2R Group. Correspondence to: Zhengxuan Wu <EMAIL>, Aryaman Arora <EMAIL>.
Pseudocode	No	The paper describes methods like Diff Mean, PCA, LAT, Linear Probe, SSV, Re FT-r1, SAE using mathematical formulas and textual explanations (e.g., equations 1-15), but it does not present them in structured pseudocode or algorithm blocks with explicit control flow statements.
Open Source Code	Yes	github.com/stanfordnlp/axbench. 1We open-source all of our datasets and trained dictionaries at https://huggingface.co/pyvene.
Open Datasets	Yes	We synthetically generate training and validation datasets (see 3.1) for 500 concepts, which we release as CONCEPT500. [...] We additionally release training and evaluation datasets for all 16K concepts in Gemma Scope as the CONCEPT16K dataset suite.
Dataset Splits	Yes	We construct a small training dataset Dtrain = {(x+ c,i, y+)}n/2 i=1 {(x c,i, y )}n/2 i=1. with n examples and a concept detection evaluation dataset Dconcept of the same structure and harder examples, where y+ and y are binary labels indicating whether the concept c is present. We set n = 144 for our main experiments. [...] For each concept, we include 144 examples for training and 72 samples for evaluating concept detection.
Hardware Specification	No	The paper mentions evaluating methods on "Gemma-2-2B and 9B" models, but does not provide specific details about the hardware (e.g., GPU models, CPU types, or cloud configurations) used to run the experiments.
Software Dependencies	No	The paper mentions using "pyvene" and "PyTorch" as well as "sklearn.decomposition.PCA" and "Adam W", but it does not specify version numbers for these software components.
Experiment Setup	Yes	To ensure a fair comparison, we perform separate hyperparameter-tuning for each method that requires training. For each method, we conduct separate hyperparameter-tuning on a small CONCEPT10 Dataset containing training and testing datasets only for 10 concepts. [...] Table 8 and Table 9 show hyperparameter settings for methods that require training. [...] We minimise the loss with Adam W with a linear scheduler for all methods that require training.