A Generalist Hanabi Agent
Authors: Arjun V Sudhakar, Hadi Nekoei, Mathieu Reymond, Miao Liu, Janarthanan Rajendran, Sarath Chandar
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this section, we analyse the generalization performance of R3D2 on Hanabi. Hanabi is a challenging task, which requires to learn conventions with other players to reach a high score. When changing the number of players participating in a game, the strategy to reach a high score changes, even though the rules of the game remain the same. [...] We assess the benefits of playing on more diverse types of gameplay, and how this improves the agent s cooperation abilities with others. 5.1 EXPERIMENTAL SETUP We train single-setting R3D2 agents for each setting, i.e., from 2-player games to 5-player games. [...] A.2 ABLATION STUDIES ON THE ROLE LANGUAGE MODELING To better understand the impact of different components in R3D2, we conduct a series of ablation studies examining the role of language model pre-training, update frequency, and architectural choices. |
| Researcher Affiliation | Collaboration | Arjun Vaithilingam Sudhakar 1,2,3 Hadi Nekoei 1,2,4 Mathieu Reymond1,2,3 Janarthanan Rajendran6 Miao Liu5 Sarath Chandar1,2,3,7 1Chandar Research Lab 2Mila Quebec AI Institute 3Polytechnique Montr eal 4Universit e de Montr eal 5IBM Research 6Dalhousie University 7Canada CIFAR AI Chair |
| Pseudocode | No | The paper describes the methodology and algorithms in descriptive text and figures (Figure 2 shows an overview of the R3D2 architecture), but it does not include any explicitly labeled pseudocode or algorithm blocks with structured steps. |
| Open Source Code | Yes | The implementation code is available at: R3D2-A-Generalist-Hanabi-Agent |
| Open Datasets | No | The dataset is acquired through self-play mode, utilizing a pre-trained OBL agent in the Hanabi game. Trajectories are filtered selectively with a gameplay score exceeding 20. Then, these trajectories are broken down into stateaction pairs to suit language model training. During the initial data exploration, we found the action categories are imbalanced as shown in 10, hence the language model overfits to discard 4 based on the confusion matrix for the prediction. To avoid that, we did categorical sampling consisting of 2200 samples per action type, aggregating to 44, 000 instances. |
| Dataset Splits | Yes | The dataset is acquired through self-play mode, utilizing a pre-trained OBL agent in the Hanabi game. [...] After which, 10% of the dataset is reserved for testing by random sampling. Further, the dataset is split into 90% for train and 10% for validation. |
| Hardware Specification | No | The authors acknowledge the computational resources provided by Mila and the Digital Research Alliance of Canada. This statement is general and does not specify particular GPU, CPU models, or memory configurations used for the experiments. |
| Software Dependencies | No | The code was implemented using Py Torch, and pre-trained language models were loaded using Huggingface. To gain insights for this paper, we employed Weights & Biases (Biewald, 2020) for experiment tracking and visualizations. Lastly, plots are created using the seaborn package. For RL algorithms, we used OBL agent (Hu et al., 2021c) to collect the expert trajectory and forked official instruct-rl codebase2 to train the algorithm. |
| Experiment Setup | Yes | Table 1: Hyper-Parameters for R3D2 agents. Hyper-parameters Value # replay buffer related burn in frames 10,000 replay buffer size 50,000 priority exponent 0.9 priority weight 0.6 max trajectory length 80 # optimization optimizer Adam lr 6.25e-05 eps 1.5e-05 grad clip 5 batchsize 64 # Q learning n step 1 (R3D2) discount factor 0.999 target network sync interval 2500 exploration ϵ ϵ0 . . . ϵn, where ϵi = 0.11+7i/(n 1), n = 80 |