Sparse Autoencoders Reveal Temporal Difference Learning in Large Language Models

Authors: Can Demircan, Tankred Saanum, Akshay Jagadish, Marcel Binz, Eric Schulz

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Through three different tasks, we first show that Llama 3 70B can solve simple RL problems in-context. We then analyze the residual stream of Llama using Sparse Autoencoders (SAEs) and find representations that closely match temporal difference (TD) errors. Notably, these representations emerge despite the model only being trained to predict the next token. We verify that these representations are indeed causally involved in the computation of TD errors and Q-values by performing carefully designed interventions on them.
Researcher Affiliation Academia 1Institute for Human-Centered AI, Helmholtz Computational Health Center, Munich, Germany 2Max Planck Institute for Biological Cybernetics, T ubingen, Germany
Pseudocode No The paper describes algorithms and methods using mathematical equations and textual descriptions, but does not include any clearly labeled pseudocode blocks or algorithms.
Open Source Code No The paper does not contain an explicit statement or a direct link indicating that the source code for the methodology described in this paper is publicly available.
Open Datasets Yes The node names are sampled from the category labels in the THINGS database (Hebart et al., 2019).
Dataset Splits Yes Llama completed 100 independent experiments initialized with unique seeds, each consisting of 30 episodes. We sampled actions from a random policy in the first 7 episodes to ease the exploration problem.
Hardware Specification No The paper mentions using "Llama 3 70B" but does not specify the hardware (e.g., GPU models, CPU types, memory) used to run the experiments.
Software Dependencies No We used the Adam optimizer (Kingma & Ba, 2017) with the default parameters... All the interventions were performed using the nnsight (Fiotto-Kaufman et al., 2024) library... metric MDS as implemented in scikit-learn (Pedregosa et al., 2011). The paper mentions specific libraries like nnsight and scikit-learn, but does not provide version numbers for these or other software dependencies.
Experiment Setup Yes For all SAEs, a batch size of 256, a learning rate of 1e 04, and β 1e 05 were used. We used the Adam optimizer (Kingma & Ba, 2017) with the default parameters and shuffled the training data... We trained each SAE using a regularization strength β 1e 05 for 30 epochs on 18000 residual stream representations... Other hyperparameters used to train the Q-learning model include the discount parameter γ 0.99 across all tasks. The learning rate α was 0.1 in the Two-Step Task and the Grid World, and 0.05 in the Graph Learning Task.