reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

PokéChamp: an Expert-level Minimax Language Agent

Authors: Seth Karten, Andy Luu Nguyen, Chi Jin

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate PokéChamp in the popular Gen 9 OU format. When powered by GPT-4o, it achieves a win rate of 76% against the best existing LLM-based bot and 84% against the strongest rule-based bot, demonstrating its superior performance. Notably, our framework requires no additional LLM training. We evaluate PokéChamp in the popular Gen 9 OU format. When powered by GPT-4o, it achieves a win rate of 76% against the best existing LLM-based bot and 84% against the strongest rule-based bot, demonstrating its superior performance.
Researcher Affiliation	Academia	Seth Karten * 1 Andy Luu Nguyen 1 Chi Jin 1 1Princeton University. Correspondence to: Seth Karten <EMAIL>.
Pseudocode	No	The paper describes the minimax tree search framework and illustrates it with Figure 4, but does not provide a formal pseudocode block or algorithm steps.
Open Source Code	Yes	Videos, code, and dataset are available online.
Open Datasets	Yes	This work compiles the largest real-player Pokémon battle dataset, featuring over 3 million games, including more than 500k high-Elo matches. Based on this dataset, we establish a series of battle benchmarks and puzzles to evaluate specific battling skills. We further provide key updates to the local game engine. This work establishes Pokémon as a benchmark to integrate LLM technologies with game-theoretic algorithms addressing general multi-agent problems. Videos, code, and dataset are available online.
Dataset Splits	No	The paper mentions compiling a dataset of over 3 million Pokémon battles and using replay data for action prediction, but it does not specify how this dataset was split into training, validation, or test sets for their experiments.
Hardware Specification	No	The paper mentions using GPT-4o and Llama 3.1:8b as language models, but it does not specify any hardware details like GPU models, CPU models, or memory used for running the experiments.
Software Dependencies	Yes	The LLM agents utilize either Llama3.1:8b (Dubey et al., 2024) or GPT-4o-2024-05-13 (Achiam et al., 2023).
Experiment Setup	Yes	We evaluate PokéChamp in the popular Gen 9 OU format. Each experiment consists of at least 25 matches between any two methods, resulting in a minimum of 100 games per method for Elo calculations. The LLM agents utilize either Llama3.1:8b (Dubey et al., 2024) or GPT-4o-2024-05-13 (Achiam et al., 2023). Each player has a total clock time of 150 seconds for the entire match. In addition, each turn has an incremental time limit of 15 seconds. If a player exceeds either the total clock time or the incremental turn time, they automatically lose the match.