Human-Aligned Chess With a Bit of Search
Authors: Yiming Zhang, Athul Jacob, Vivian Lai, Daniel Fried, Daphne Ippolito
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In offline evaluations, we find that ALLIE exhibits humanlike behavior: it outperforms the existing state-of-the-art in human chess move prediction and ponders at critical positions. The model learns to reliably assign reward at each game state, which can be used at inference as a reward function in a novel time-adaptive Monte-Carlo tree search (MCTS) procedure, where the amount of search depends on how long humans would think in the same positions. Adaptive search enables remarkable skill calibration; in a large-scale online evaluation against players with ratings from 1000 to 2600 Elo, our adaptive search method leads to a skill gap of only 49 Elo on average, substantially outperforming search-free and standard MCTS baselines. |
| Researcher Affiliation | Collaboration | Yiming Zhang1 Athul Paul Jacob2 Vivian Lai3 Daniel Fried1 Daphne Ippolito1 1Carnegie Mellon University 2MIT 3Visa Research |
| Pseudocode | No | The paper describes methods and procedures in prose, but does not contain any clearly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | 1Code, data and model weights are available on Git Hub. |
| Open Datasets | Yes | We constructed a raw dataset of chess games using all blitz6 games played in 2022 on Lichess, a popular online chess platform.7 7https://database.lichess.org/ |
| Dataset Splits | Yes | From this downsampled dataset, we use 18 thousand games for testing, and the remaining games for training and validation. In total, the training set contains 91 million games and 6.6 billion tokens. Our primary automatic evaluation metric is move-matching accuracy how often does the model correctly predict the next move in the game. Following Mc Ilroy-Young et al. (2020), when evaluating accuracy, we discard the first 5 moves of each game, which reduces the impact of opening memorization (there are only so many ways to begin a chess game). We further omit from evaluation any moves made under time pressure (when there is less than 30 seconds on the clock) to avoid the influence of random moves made due to being low on time. This leaves us with 884,049 positions from an evaluation test set. |
| Hardware Specification | Yes | The model is trained for 2M steps with a global batch size of 131,072 tokens on our training set. This corresponds to roughly 40 epochs over the training data. Additional training details and hyperparameters are provided in Appendix E.1. In Appendix F, we explore the effect of both dataset size and parameter count on model capability. We find that our setting is mostly data-constrained model performance is limited by the number of human chess games available on the Internet and doubling model size has only a small effect on the model s ability of predicting human moves. The model is trained for 2 million steps, which took approximately 2 weeks on 8 NVIDIA A6000 GPUs using bfloat16 precision. |
| Software Dependencies | No | The paper mentions that ALLIE is a GPT-2-style transformer decoder model, but does not specify versions for software dependencies like Python, PyTorch, or other libraries. |
| Experiment Setup | Yes | ALLIE is a GPT-2-style (Brown et al., 2020) transformer decoder model with 355M parameters, trained on a dataset of 6.6 billion tokens. We use a global batch size of 131,072 tokens, a learning rate of 6 x 10^-4, decaying to 1 x 10^-5 using cosine annealing (Loshchilov & Hutter, 2017), and a maximum sequence length of 512 tokens. The model is trained for 2 million steps, which took approximately 2 weeks on 8 NVIDIA A6000 GPUs using bfloat16 precision. |