reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Designing Skill-Compatible AI: Methodologies and Frameworks in Chess

Authors: Karim Hamade, Reid McIlroy-Young, Siddhartha Sen, Jon Kleinberg, Ashton Anderson

ICLR 2024 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our agents outperform state-of-the-art chess AI (based on Alpha Zero) despite being weaker in conventional chess, demonstrating that skill-compatibility is a tangible trait that is qualitatively and measurably distinct from raw performance. Our evaluations further explore and clarify the mechanisms by which our agents achieve skill-compatibility.
Researcher Affiliation	Collaboration	Karim Hamade Reid Mc Ilroy-Young Siddhartha Sen EMAIL EMAIL EMAIL University of Toronto University of Toronto Microsoft Research Jon Kleinberg Ashton Anderson EMAIL EMAIL Cornell University University of Toronto
Pseudocode	No	The paper does not contain any sections explicitly labeled as "Pseudocode" or "Algorithm", nor does it present structured steps in a code-like format.
Open Source Code	Yes	Our code is released at github.com/CSSLab/skill-compatibility-chess. We also include several of our trained models.
Open Datasets	No	The paper states maia was trained on games from lichess.org, an open-source platform. However, it does not provide a direct link, DOI, specific repository for the dataset used, or a formal citation of the dataset itself, only the platform source.
Dataset Splits	Yes	To create att, a dataset of 10000 games (80% train, 10% validate, and 10% test) is generated of the following game leela maia leela maia for STT or leela maia leela maia for HB.
Hardware Specification	Yes	We made use of four Tesla K80 GPU s for the purpose of experimentation, each with a VRAM of 12 GB.
Software Dependencies	Yes	Against stockfish 13 (60k nodes), a strong classical engine that uses alpha-beta search, this version of leela obtains a score of 59 3.
Experiment Setup	Yes	To create att, a dataset of 10000 games (80% train, 10% validate, and 10% test) is generated of the following game leela maia leela maia for STT or leela maia leela maia for HB. Then, starting with leela s weights, and using a learning rate of 10 5, and 10000 iterations, we run back-propagation to update leela s policy and value neural network.