Intelligence at the Edge of Chaos

Authors: Shiyang Zhang, Aakash Patel, Syed Rizvi, Nianchen Liu, Sizhuang He, Amin Karbasi, Emanuele Zappala, David van Dijk

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Figure 2 presents the model performance across three downstream tasks (easy reasoning, hard reasoning, and chess move prediction) as a function of the complexity of the ECA rules the models were pretrained on. For the reasoning tasks, models generally achieve near-perfect accuracy when trained for a sufficient number of epochs. Therefore, instead of reporting absolute accuracy, we focus on model efficiency, defined as the inverse of the number of epochs required to reach 80% accuracy. The chess task is sufficiently difficult that models do not achieve perfect performance, and so we report the final accuracy.
Researcher Affiliation Academia 1Yale University, 2Columbia University, 3Northwestern University, 4Idaho State University
Pseudocode No The paper describes the methodology in prose, primarily in sections 3 and 4. There are no explicitly labeled 'Pseudocode' or 'Algorithm' blocks, nor are there any structured, code-like formatted steps for procedures.
Open Source Code Yes The complete code for our paper is now available in the following Git Hub repository: https://github.com/vandijklab/Intelligence at the edge of chaos.
Open Datasets Yes We use chess games from the Lichess Elite database (Lic), focusing on games played between January and April 2016 by Grandmasters with ratings of 2200 and above. [Reference:] Lichess elite database. https://database.nikonoel.fr/. Accessed: 2024-09-30.
Dataset Splits Yes We split this collection [Lichess Elite database] into training, validation, and test sets using an 80-10-10 split to facilitate model training and evaluation. The dataset [Nim game] was split into training and validation sets using a 90-10 ratio.
Hardware Specification Yes The models were trained on 12 NVIDIA H100 GPUs, each with 80 GB of memory, running on Red Hat Enterprise Linux 8.8.
Software Dependencies Yes The experiments were conducted using Py Torch version 2.1.2 and the Transformers library (version 4.41.0), with CUDA version 12.4 for GPU acceleration.
Experiment Setup Yes Each model was pretrained on next-token prediction tasks using data generated from a single ECA rule for up to 10,000 epochs. ... The training data were organized into batches of 64 sequences, each comprising 60 time steps and 100 spatial dimensions. We employed the Adam optimizer with an initial learning rate η = 2 10 6 and a weight decay of 0.01. A learning rate scheduler with a linear warm-up over the first 10% of the total steps was implemented to stabilize the initial stages of training and improve convergence rates. After the warm-up phase, we applied cosine annealing to gradually decay the learning rate over the remaining training steps. Gradient accumulation was used to handle larger effective batch sizes within the constraints of GPU memory, allowing us to simulate larger batch sizes by accumulating gradients over multiple mini-batches. To prevent exploding gradients, we applied gradient clipping with a maximum norm of 1.0.