PokéChamp: an Expert-level Minimax Language Agent

Authors: Seth Karten, Andy Luu Nguyen, Chi Jin

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate PokéChamp in the popular Gen 9 OU format. When powered by GPT-4o, it achieves a win rate of 76% against the best existing LLM-based bot and 84% against the strongest rule-based bot, demonstrating its superior performance. Notably, our framework requires no additional LLM training. We evaluate PokéChamp in the popular Gen 9 OU format. When powered by GPT-4o, it achieves a win rate of 76% against the best existing LLM-based bot and 84% against the strongest rule-based bot, demonstrating its superior performance.
Researcher Affiliation Academia Seth Karten * 1 Andy Luu Nguyen 1 Chi Jin 1 1Princeton University. Correspondence to: Seth Karten <EMAIL>.
Pseudocode No The paper describes the minimax tree search framework and illustrates it with Figure 4, but does not provide a formal pseudocode block or algorithm steps.
Open Source Code Yes Videos, code, and dataset are available online.
Open Datasets Yes This work compiles the largest real-player Pokémon battle dataset, featuring over 3 million games, including more than 500k high-Elo matches. Based on this dataset, we establish a series of battle benchmarks and puzzles to evaluate specific battling skills. We further provide key updates to the local game engine. This work establishes Pokémon as a benchmark to integrate LLM technologies with game-theoretic algorithms addressing general multi-agent problems. Videos, code, and dataset are available online.
Dataset Splits No The paper mentions compiling a dataset of over 3 million Pokémon battles and using replay data for action prediction, but it does not specify how this dataset was split into training, validation, or test sets for their experiments.
Hardware Specification No The paper mentions using GPT-4o and Llama 3.1:8b as language models, but it does not specify any hardware details like GPU models, CPU models, or memory used for running the experiments.
Software Dependencies Yes The LLM agents utilize either Llama3.1:8b (Dubey et al., 2024) or GPT-4o-2024-05-13 (Achiam et al., 2023).
Experiment Setup Yes We evaluate PokéChamp in the popular Gen 9 OU format. Each experiment consists of at least 25 matches between any two methods, resulting in a minimum of 100 games per method for Elo calculations. The LLM agents utilize either Llama3.1:8b (Dubey et al., 2024) or GPT-4o-2024-05-13 (Achiam et al., 2023). Each player has a total clock time of 150 seconds for the entire match. In addition, each turn has an incremental time limit of 15 seconds. If a player exceeds either the total clock time or the incremental turn time, they automatically lose the match.