reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Understanding the learned look-ahead behavior of chess neural networks

Authors: Diogo Cruz

TMLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We investigate the look-ahead capabilities of chess-playing neural networks, specifically focusing on the Leela Chess Zero policy network. Our findings reveal that the network s look-ahead behavior is highly context-dependent, varying significantly based on the specific chess position. We demonstrate that the model can process information about board states up to seven moves ahead, utilizing similar internal mechanisms across different future time steps. All experiments were run using an RTX 3070Ti, with a combined runtime of 2 days.
Researcher Affiliation	Industry	Diogo Cruz EMAIL Pivotal
Pseudocode	No	The paper describes analysis techniques (Activation Patching, Probing, Ablation) but does not present them in a structured pseudocode or algorithm block format. It describes the methodology in narrative text.
Open Source Code	Yes	Our implementation is heavily based on the implementation described in Jenner et al. (2024), and previously made available at https://github.com/Human Compatible AI/leela-interp. For the activation patching, probing, and zero ablation results, modifications were made to account for the case of more than 3 moves. Code for reproducing our results is available at https://github.com/diogo-cruz/leela-interp.
Open Datasets	Yes	We use Lichess 4 million puzzle database as a starting point. Each puzzle in our dataset has a starting state with a single winning move for the player whose turn it is, along with an annotated principal variation (the optimal sequence of moves for both players from the starting state). Lichess. Lichess database: Puzzles. https://database.lichess.org/#puzzles, 2025. Data under CC0 1.0; puzzles file last updated 2025-08-02.
Dataset Splits	Yes	The puzzles were curated into three datasets: a 22k puzzle dataset used in Jenner et al. (2024), solvable for the Leela model but difficult for weaker models to solve, and used for the 3 and 5-move analysis; a 2.2k dataset of 7-move puzzles; and 609 puzzles for the alternative move analysis. Additional details on the dataset generation, and their difficulty level, can be found in Appendices F and H.
Hardware Specification	Yes	All experiments were run using an RTX 3070Ti, with a combined runtime of 2 days.
Software Dependencies	No	The paper mentions using a 'Leela Chess Zero (Leela) policy network' and that 'Our implementation is heavily based on the implementation described in Jenner et al. (2024)'. It also mentions 'Stockfish (depth 22, 8 threads, 2GB hash table, NNUE enabled)' for evaluation. However, specific version numbers for software components like Leela, Python, or machine learning frameworks (e.g., PyTorch, TensorFlow) are not provided.
Experiment Setup	Yes	We employ three main techniques to analyze the internal representations of the model: Activation Patching, Probing, and Ablation. For activation patching, we first run the model on the original position to get the clean activations. We then create a corrupted position by replacing specific moves in the game history and run the model on this corrupted position. Let mc be the correct move, sp be the patched model state, and sc be the clean model state. The log odds change L of the target move is then defined as: L = log odds(mc \| sp) log odds(mc \| sc). For probing, we extract activations from each attention head when running the model on chess positions. We then train a bilinear probe to predict the board square associated with the move of interest. For ablation, we selectively set certain activations to zero. The original 3-move dataset was created by starting from the Lichess chess puzzle database and filtering for puzzles where: the weaker model assigned less than a 5% probability to the optimal first move; the Leela model assigned at least a 50% probability to the 1st, 2nd, and 3rd optimal moves; the weaker model assigned more than a 70% probability to the optimal 2nd move.