reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Ad-Hoc Human-AI Coordination Challenge

Authors: Tin Dizdarević, Ravi Hammond, Tobias Gessler, Anisoara Calinescu, Jonathan Cook, Matteo Gallici, Andrei Lupu, Jakob Nicolaus Foerster

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We develop human proxy agents on a large-scale human dataset that serve as robust, cheap, and reproducible human-like evaluation partners in AH2AC2. To encourage the development of data-efficient methods, we opensource a dataset of 3,079 games... We present baseline results for both two- and three-player Hanabi scenarios... Our empirical evaluation shows that these human proxy agents outperform pure imitation learning while maintaining human-like behaviour.
Researcher Affiliation	Academia	1FLAIR, University of Oxford, Oxford, UK 2Department of Computer Science, University of Oxford, UK 3Universitat Polit ecnica de Catalunya, Barcelona, Spain.
Pseudocode	No	The paper describes methods like HDR-IPPO, IPPO, and BC using mathematical formulations and textual explanations, but it does not contain any structured pseudocode or algorithm blocks.
Open Source Code	Yes	The code is available at https://github.com/FLAIROx/ah2ac2. We provide our code in an anonymous repository: https://anonymous.4open.science/r/ah2ac2-2FDA. At the time of submission, the open-sourced codebase includes: ... Code for training the baselines.
Open Datasets	Yes	To encourage the development of data-efficient methods, we opensource a dataset of 3,079 games, deliberately limiting the amount of available human gameplay data. The first open-source Hanabi human gameplay dataset, containing 1,858 two-player and 1,221 three-player games.
Dataset Splits	Yes	Specifically, for the 2p setting, we select 858 games for validation and 858 for testing; for the 3p setting, we allocate 221 games for validation and 221 for testing.
Hardware Specification	No	The paper does not provide specific GPU/CPU models, processor types, memory details, or any other explicit hardware specifications used for running its experiments.
Software Dependencies	No	The paper mentions software components and methods such as the Adam optimiser, PPO, and GRU, but it does not provide specific version numbers for any key software libraries, frameworks, or environments used for replication.
Experiment Setup	Yes	Table 9. Human proxy agent training configurations and architectures. We showcase both BC and IPPO hyperparameters in a single table. Table 12. Hyperparameters used for training agents in the ablation study. Table 15. Hyperparameters used for training BC, HDR-IPPO baselines on a 1,000-game data limit challenge. Table 17. Hyperparameters used for training all IPPO and BR-BC baseline agents.