Towards a General Transfer Approach for Policy-Value Networks

Authors: Dennis J. N. J. Soemers, Vegard Mella, Eric Piette, Matthew Stephenson, Cameron Browne, Olivier Teytaud

TMLR 2023 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive results including various cases of highly successful zero-shot transfer are provided for a wide variety of source and target games. This section discusses experiments2 used to evaluate the performance of fully convolutional architectures, as well as transfer learning between variants of games and between distinct games.
Researcher Affiliation Collaboration Dennis J.N.J. Soemers EMAIL Department of Advanced Computing Sciences, Maastricht University Vegard Mella EMAIL Meta AI Research Éric Piette EMAIL ICTEAM, UCLouvain Matthew Stephenson EMAIL College of Science and Engineering, Flinders University Cameron Browne EMAIL Department of Advanced Computing Sciences, Maastricht University Olivier Teytaud EMAIL Meta AI Research
Pseudocode No The paper describes the transfer approach and network architecture in text and diagrams (Figure 5 in Appendix A is an architecture diagram), but does not contain a dedicated pseudocode or algorithm block.
Open Source Code Yes Source code used to transfer weights: https://github.com/Dennis Soemers/Transfer-DNNs-Ludii-Polygames. Source code: https://github.com/Dennis Soemers/Transfer-DNNs-Ludii-Polygames.
Open Datasets Yes As an example, we use the Ludii general game system, which includes a highly varied set of over 1000 distinct games described in such a language. Ludii (Browne et al., 2020; Piette et al., 2020) is a general game system with over 1000 built-in games, many of which support multiple variants with different board sizes, board shapes, rulesets, etc. It provides suitable object-oriented state and action representations for any game described in its game description language, and these can be converted into tensor representations in a consistent manner with no need for additional game-specific engineering effort (Soemers et al., 2022).
Dataset Splits Yes We evaluate zero-shot transfer performance for a source domain S and target domain T by reporting the win percentage of the model trained in S against the model that was trained in T, over 300 evaluation games per (S, T) pair running in T. In each set of 300 evaluation games, each agent plays as the first player in 150 games, and as the second player in the other 150 games.
Hardware Specification No Models of various sizes measured by the number of trainable parameters have been constructed by randomly drawing choices for hyperparameters such as the number of layers, blocks, and channels for hidden layers. After training, we evaluated the performance of every model by recording the win percentage of an MCTS agent using 40 iterations per move with the model, versus a standard untrained UCT (Browne et al., 2012) agent with 800 iterations per move. The untrained UCT backs up the average outcome of ten random rollouts from the node it traverses to in each iteration. These win percentages are depicted in Figure 1. In the majority of cases, Res Conv Conv Logit Pool Model a fully convolutional model with global pooling is among the strongest architectures. Fully convolutional models generally outperform ones with dense layers, and models with global pooling generally outperform those without global pooling. This suggests that using such architectures can be beneficial in and of itself, and their use to facilitate transfer learning does not lead to a sacrifice in baseline performance. Therefore, all transfer learning experiments discussed below used the Res Conv Conv Logit Pool Model architecture from Polygames. All models were trained for 20 hours on 8 GPUs and 80 CPU cores, using 1 server for training and 7 clients for the generation of self-play games.
Software Dependencies Yes We used the training code from Polygames. For transfer learning experiments, we used games as implemented in Ludii v1.1.6. Every used game is a zero-sum game for two players. All games may be considered sparse-reward problems in the sense that they only have potential non-zero rewards when terminal game states are reached, and no reward shaping is used. Some of the games, such as Breakthrough and Hex, naturally progress towards terminal game states with non-zero rewards regardless of player strength (even under random play), whereas others also have terminal game states with outcomes equal to 0 for both players (e.g., Diagonal Hex). Appendix E provides details on hyperparameter values used for training throughout all experiments.
Experiment Setup Yes For all training runs for transfer learning experiments, the following command-line arguments were supplied to the train command of Polygames (Cazenave et al., 2020): --num_game 2: Affects the number of threads used to run games per self-play client process. --epoch_len 256: Number of training batches per epoch. --batchsize 128: Batch size for model training. --sync_period 32: Affects how often models are synced. --num_rollouts 400: Number of MCTS iterations per move during self-play training. --replay_capacity 100000: Capacity of replay buffer. --replay_warmup 9000: Minimum size of replay buffer before training starts. --model_name "Res Conv Conv Logit Pool Model V2": Type of architecture to use (a fully convolutional architecture with global pooling). --bn: Use of batch normalization (Ioffe & Szegedy, 2015). --nnsize 2: A value of 2 means that hidden convolutional layers each have twice as many channels as the number of channels for the state input tensors. --nb_layers_per_net 6: Number of convolutional layers per residual block. --nb_nets 10: Number of residual blocks. --tournament_mode=true: Use the tournament mode of Polygames to select checkpoints to play against in self-play. --bsfinder_max_bs=800: Upper bound on number of neural network queries batched together during inference (we used a lower value of 400 to reduce memory usage in Breakthrough, Hasami Shogi, Kyoto Shogi, Minishogi, Shogi, and Tobi Shogi).