reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

A Generalist Hanabi Agent

Authors: Arjun V Sudhakar, Hadi Nekoei, Mathieu Reymond, Miao Liu, Janarthanan Rajendran, Sarath Chandar

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this section, we analyse the generalization performance of R3D2 on Hanabi. Hanabi is a challenging task, which requires to learn conventions with other players to reach a high score. When changing the number of players participating in a game, the strategy to reach a high score changes, even though the rules of the game remain the same. [...] We assess the benefits of playing on more diverse types of gameplay, and how this improves the agent s cooperation abilities with others. 5.1 EXPERIMENTAL SETUP We train single-setting R3D2 agents for each setting, i.e., from 2-player games to 5-player games. [...] A.2 ABLATION STUDIES ON THE ROLE LANGUAGE MODELING To better understand the impact of different components in R3D2, we conduct a series of ablation studies examining the role of language model pre-training, update frequency, and architectural choices.
Researcher Affiliation	Collaboration	Arjun Vaithilingam Sudhakar 1,2,3 Hadi Nekoei 1,2,4 Mathieu Reymond1,2,3 Janarthanan Rajendran6 Miao Liu5 Sarath Chandar1,2,3,7 1Chandar Research Lab 2Mila Quebec AI Institute 3Polytechnique Montr eal 4Universit e de Montr eal 5IBM Research 6Dalhousie University 7Canada CIFAR AI Chair
Pseudocode	No	The paper describes the methodology and algorithms in descriptive text and figures (Figure 2 shows an overview of the R3D2 architecture), but it does not include any explicitly labeled pseudocode or algorithm blocks with structured steps.
Open Source Code	Yes	The implementation code is available at: R3D2-A-Generalist-Hanabi-Agent
Open Datasets	No	The dataset is acquired through self-play mode, utilizing a pre-trained OBL agent in the Hanabi game. Trajectories are filtered selectively with a gameplay score exceeding 20. Then, these trajectories are broken down into stateaction pairs to suit language model training. During the initial data exploration, we found the action categories are imbalanced as shown in 10, hence the language model overfits to discard 4 based on the confusion matrix for the prediction. To avoid that, we did categorical sampling consisting of 2200 samples per action type, aggregating to 44, 000 instances.
Dataset Splits	Yes	The dataset is acquired through self-play mode, utilizing a pre-trained OBL agent in the Hanabi game. [...] After which, 10% of the dataset is reserved for testing by random sampling. Further, the dataset is split into 90% for train and 10% for validation.
Hardware Specification	No	The authors acknowledge the computational resources provided by Mila and the Digital Research Alliance of Canada. This statement is general and does not specify particular GPU, CPU models, or memory configurations used for the experiments.
Software Dependencies	No	The code was implemented using Py Torch, and pre-trained language models were loaded using Huggingface. To gain insights for this paper, we employed Weights & Biases (Biewald, 2020) for experiment tracking and visualizations. Lastly, plots are created using the seaborn package. For RL algorithms, we used OBL agent (Hu et al., 2021c) to collect the expert trajectory and forked official instruct-rl codebase2 to train the algorithm.
Experiment Setup	Yes	Table 1: Hyper-Parameters for R3D2 agents. Hyper-parameters Value # replay buffer related burn in frames 10,000 replay buffer size 50,000 priority exponent 0.9 priority weight 0.6 max trajectory length 80 # optimization optimizer Adam lr 6.25e-05 eps 1.5e-05 grad clip 5 batchsize 64 # Q learning n step 1 (R3D2) discount factor 0.999 target network sync interval 2500 exploration ϵ ϵ0 . . . ϵn, where ϵi = 0.11+7i/(n 1), n = 80