reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Sparse Autoencoders Reveal Temporal Difference Learning in Large Language Models

Authors: Can Demircan, Tankred Saanum, Akshay Jagadish, Marcel Binz, Eric Schulz

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Through three different tasks, we first show that Llama 3 70B can solve simple RL problems in-context. We then analyze the residual stream of Llama using Sparse Autoencoders (SAEs) and find representations that closely match temporal difference (TD) errors. Notably, these representations emerge despite the model only being trained to predict the next token. We verify that these representations are indeed causally involved in the computation of TD errors and Q-values by performing carefully designed interventions on them.
Researcher Affiliation	Academia	1Institute for Human-Centered AI, Helmholtz Computational Health Center, Munich, Germany 2Max Planck Institute for Biological Cybernetics, T ubingen, Germany
Pseudocode	No	The paper describes algorithms and methods using mathematical equations and textual descriptions, but does not include any clearly labeled pseudocode blocks or algorithms.
Open Source Code	No	The paper does not contain an explicit statement or a direct link indicating that the source code for the methodology described in this paper is publicly available.
Open Datasets	Yes	The node names are sampled from the category labels in the THINGS database (Hebart et al., 2019).
Dataset Splits	Yes	Llama completed 100 independent experiments initialized with unique seeds, each consisting of 30 episodes. We sampled actions from a random policy in the first 7 episodes to ease the exploration problem.
Hardware Specification	No	The paper mentions using "Llama 3 70B" but does not specify the hardware (e.g., GPU models, CPU types, memory) used to run the experiments.
Software Dependencies	No	We used the Adam optimizer (Kingma & Ba, 2017) with the default parameters... All the interventions were performed using the nnsight (Fiotto-Kaufman et al., 2024) library... metric MDS as implemented in scikit-learn (Pedregosa et al., 2011). The paper mentions specific libraries like nnsight and scikit-learn, but does not provide version numbers for these or other software dependencies.
Experiment Setup	Yes	For all SAEs, a batch size of 256, a learning rate of 1e 04, and β 1e 05 were used. We used the Adam optimizer (Kingma & Ba, 2017) with the default parameters and shuffled the training data... We trained each SAE using a regularization strength β 1e 05 for 30 epochs on 18000 residual stream representations... Other hyperparameters used to train the Q-learning model include the discount parameter γ 0.99 across all tasks. The learning rate α was 0.1 in the Two-Step Task and the Grid World, and 0.05 in the Graph Learning Task.