Scaling and evaluating sparse autoencoders
Authors: Leo Gao, Tom Dupre la Tour, Henk Tillman, Gabriel Goh, Rajan Troll, Alec Radford, Ilya Sutskever, Jan Leike, Jeffrey Wu
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We systematically study the scaling laws with respect to sparsity, autoencoder size, and language model size. To demonstrate that our methodology can scale reliably, we train a 16 million latent autoencoder on GPT-4 (Open AI, 2023) residual stream activations. We also introduce several new metrics for evaluating feature quality based on the recovery of hypothesized features, the explainability of activation patterns, and the sparsity of downstream effects. These metrics all generally improve with autoencoder size. |
| Researcher Affiliation | Industry | Leo Gao Tom Dupré la Tour Henk Tillman Gabriel Goh Rajan Troll Alec Radford Ilya Sutskever Jan Leike Jeffrey Wu Open AI EMAIL |
| Pseudocode | No | The paper provides mathematical equations and describes methods in text, such as: "z = Re LU(Wenc(x bpre) + benc) ˆx = Wdecz + bpre (1)", "z = Top K(Wenc(x bpre)) (2)", and "L(n, k) = exp(α + βk log(k) + βn log(n) + γ log(k) log(n)) + exp(ζ + η log(k)) (3)". However, it does not include any clearly labeled 'Pseudocode' or 'Algorithm' blocks, nor does it present structured, code-like steps for a procedure. |
| Open Source Code | Yes | To demonstrate the scalability of our approach, we train a 16 million latent autoencoder on GPT-4 activations for 40 billion tokens. We release code and autoencoders for open-source models, as well as a visualizer. |
| Open Datasets | Yes | Inputs: We train autoencoders on the residual streams of both GPT-2 small (Radford et al., 2019) and models from a series of models of increasing size, sharing GPT-4 architecture and training setup, including GPT-4 itself (Open AI, 2023). ... Table 1: Tasks used in the probe-based evaluation suite ... amazon Mc Auley & Leskovec (2013), sciq Welbl et al. (2017), truthfulqa Lin et al. (2021), piqa Bisk et al. (2020), ag_news Gulli, europarl_es Koehn (2005), jigsaw Cjadams et al. (2017). |
| Dataset Splits | No | The paper mentions using a "context length of 64 tokens for all experiments" and for N2G explanations, it uses "a random sample of up to 16 nonzero activations to build the graph, and another 16 as true positives for computing recall." However, it does not provide explicit training, validation, or test dataset splits (e.g., percentages or absolute counts) for the main autoencoder training or for the general evaluation tasks. While it uses known datasets, it does not specify how these datasets are partitioned for the experiments conducted in this paper. |
| Hardware Specification | No | The paper mentions "Model parallelism is necessary once parameters cannot fit on one GPU." and "For the largest (16 million) latent autoencoder, we use 512-way sharding." It refers to GPUs and parallelism but does not specify any exact GPU models (e.g., NVIDIA A100), CPU models, memory details, or detailed computer specifications used for running the experiments. |
| Software Dependencies | No | The paper mentions using the "Adam optimizer (Kingma & Ba, 2014)" and "torch default initialization" (implying PyTorch) and "NCCL" for parallelism. However, it does not provide specific version numbers for any of these software components, or other key libraries like Python or CUDA. |
| Experiment Setup | Yes | Hyperparameters: To simplify analysis, we do not consider learning rate warmup or decay unless otherwise noted. We sweep learning rates at small scales and extrapolate the trend of optimal learning rates for large scale. See Appendix A for other optimization details. Appendix A.1 Initialization: We initialize the bias bpre to be the geometric median... We initialize the encoder directions parallel to the respective decoder directions... We scale decoder latent directions to be unit norm... Appendix A.2 Auxiliary Loss: We define an auxiliary loss (Aux K) similar to ghost grads... (typically kaux = 512)... α is a small coefficient (typically 1/32). Appendix A.3 Optimizer: We use the Adam optimizer (Kingma & Ba, 2014) with β1 = 0.9 and β2 = 0.999, and a constant learning rate. ... We use ε = 6.25 10-10... Appendix A.4 Batch Size: ...we use a batch size of 131,072 tokens for most of our experiments. Appendix A.5 Weight Averaging: We find that keeping an exponential moving average (EMA)... We use an EMA coefficient of 0.999... |