reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Intelligence at the Edge of Chaos

Authors: Shiyang Zhang, Aakash Patel, Syed Rizvi, Nianchen Liu, Sizhuang He, Amin Karbasi, Emanuele Zappala, David van Dijk

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Figure 2 presents the model performance across three downstream tasks (easy reasoning, hard reasoning, and chess move prediction) as a function of the complexity of the ECA rules the models were pretrained on. For the reasoning tasks, models generally achieve near-perfect accuracy when trained for a sufficient number of epochs. Therefore, instead of reporting absolute accuracy, we focus on model efficiency, defined as the inverse of the number of epochs required to reach 80% accuracy. The chess task is sufficiently difficult that models do not achieve perfect performance, and so we report the final accuracy.
Researcher Affiliation	Academia	1Yale University, 2Columbia University, 3Northwestern University, 4Idaho State University
Pseudocode	No	The paper describes the methodology in prose, primarily in sections 3 and 4. There are no explicitly labeled 'Pseudocode' or 'Algorithm' blocks, nor are there any structured, code-like formatted steps for procedures.
Open Source Code	Yes	The complete code for our paper is now available in the following Git Hub repository: https://github.com/vandijklab/Intelligence at the edge of chaos.
Open Datasets	Yes	We use chess games from the Lichess Elite database (Lic), focusing on games played between January and April 2016 by Grandmasters with ratings of 2200 and above. [Reference:] Lichess elite database. https://database.nikonoel.fr/. Accessed: 2024-09-30.
Dataset Splits	Yes	We split this collection [Lichess Elite database] into training, validation, and test sets using an 80-10-10 split to facilitate model training and evaluation. The dataset [Nim game] was split into training and validation sets using a 90-10 ratio.
Hardware Specification	Yes	The models were trained on 12 NVIDIA H100 GPUs, each with 80 GB of memory, running on Red Hat Enterprise Linux 8.8.
Software Dependencies	Yes	The experiments were conducted using Py Torch version 2.1.2 and the Transformers library (version 4.41.0), with CUDA version 12.4 for GPU acceleration.
Experiment Setup	Yes	Each model was pretrained on next-token prediction tasks using data generated from a single ECA rule for up to 10,000 epochs. ... The training data were organized into batches of 64 sequences, each comprising 60 time steps and 100 spatial dimensions. We employed the Adam optimizer with an initial learning rate η = 2 10 6 and a weight decay of 0.01. A learning rate scheduler with a linear warm-up over the first 10% of the total steps was implemented to stabilize the initial stages of training and improve convergence rates. After the warm-up phase, we applied cosine annealing to gradually decay the learning rate over the remaining training steps. Gradient accumulation was used to handle larger effective batch sizes within the constraints of GPU memory, allowing us to simulate larger batch sizes by accumulating gradients over multiple mini-batches. To prevent exploding gradients, we applied gradient clipping with a maximum norm of 1.0.