reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Perceptual Similarity for Measuring Decision-Making Style and Policy Diversity in Games

Authors: Chiu-Chou Lin, Wei-Chen Chiu, I-Chen Wu

TMLR 2024 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Across two racing games and seven Atari games, our techniques significantly improve the precision of zero-shot playstyle classification, achieving an accuracy exceeding 90% with fewer than 512 observation-action pairs less than half an episode of these games. Furthermore, our experiments with 2048 and Go demonstrate the potential of discrete playstyle measures in puzzle and board games.
Researcher Affiliation	Academia	Chiu-Chou Lin EMAIL Department of Computer Science National Yang Ming Chiao Tung University, Hsinchu 30010, Taiwan Wei-Chen Chiu EMAIL Department of Computer Science National Yang Ming Chiao Tung University, Hsinchu 30010, Taiwan I-Chen Wu EMAIL Department of Computer Science National Yang Ming Chiao Tung University, Hsinchu 30010, Taiwan Research Center for Information Technology Innovation Academia Sinica, Taipei 11529, Taiwan
Pseudocode	Yes	Algorithm 1 Measuring Policy Diversity Input: Policy π, Environment E, Similarity measure M Input: Similarity threshold t, Number of trajectories N 1: Initialize S (store trajectories) and diverse trajectory count d = 0 2: for i = 1 to N do 3: Generate a trajectory τi π, E 4: Set is_diverse = true 5: for each τj in S do 6: Compute similarity M(τi, τj) 7: if M(τi, τj) t then 8: is_diverse = false 9: break 10: end if 11: end for 12: if is_diverse then 13: d = d + 1 14: end if 15: Store τi in S 16: end for Output: Return d (diverse trajectory count) and N (total trajectories)
Open Source Code	No	It is crucial to clarify that our research did not involve the training of new encoder models. Instead, we leveraged three pretrained encoder models and corresponding datasets for each game, provided by Lin et al. (2021). The associated resources are available in their official release.1 The game details are listed in Table 1. 1https://paperswithcode.com/paper/an-unsupervised-video-game-playstyle-metric
Open Datasets	Yes	Our study encompasses three distinct game platforms, as depicted in Figures 3a, 3b, and 3c: 1. TORCS: This racing game features stable, controlled rule-based AI players (Yoshida et al., 2017). 2. RGSK Racing Game Starter Kit: This racing game, available on the Unity Asset Store (Juliani et al., 2020)... 3. Atari games with DRL agents: The dataset spans 7 different Atari games (Bellemare et al., 2013) from this platform. Each game includes 20 AI models, all of which demonstrate varied playstyles. These AI models originate from the DRL framework, Dopamine (Castro et al., 2018). ... The Go dataset used in this study was sourced from Fox Go (Fox Go, 2024a;b) and provided by the team of the Mini Zero framework (Wu et al., 2024).
Dataset Splits	Yes	Our playstyle classification adheres to the zero-shot methodology. As depicted in Figure 3d, we start with a query dataset N, sampled from a playstyle Stylen. We then compare this to multiple reference datasets M, each sampled from different playstyles Style. We perform 100 rounds of random subsampling for each playstyle; our primary performance metric for this task is the accuracy of playstyle classification. ... For each player, we collected 1000 episodes, using the first 500 as the reference dataset and the remaining 500 as separate query datasets. This resulted in a total of 5000 query datasets for the experiment. ... Another dataset includes 200 human players with Go skill ranging from 1 Dan to 9 Dan, each contributing 100 games to the query datasets and 100 games to the candidate datasets.
Hardware Specification	No	No specific hardware details (like CPU/GPU models, memory, or cloud instances) are mentioned for running the experiments. The paper discusses training DRL agents and encoder models but does not specify the hardware used for these processes or for the evaluations.
Software Dependencies	No	The paper mentions using 'training code available on Git Hub' for 2048 agents and the 'Mini Zero framework' for the Go dataset, as well as the 'Adam optimizer'. However, no specific version numbers for these or any other software dependencies (like Python, PyTorch, TensorFlow, CUDA) are provided, which are necessary for reproducibility.
Experiment Setup	Yes	We set the learning rate (α) to 0.01 and maintained all other default settings. ... We train the encoder with a batch size of 1024 over 100 iterations, each iteration including 1000 network updates with the Adam optimizer. The learning rate starts at 0.00025 and linearly decays to 0 according to the iteration number. The coefficient β in the vector quantization process is the commonly suggested 0.25 (van den Oord et al., 2017; Lin et al., 2021). The loss function for the policy head is cross-entropy, and the loss for the value head is mean square error, with the loss coefficients of these two heads both set to 1.