PokéChamp: an Expert-level Minimax Language Agent
Authors: Seth Karten, Andy Luu Nguyen, Chi Jin
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate PokéChamp in the popular Gen 9 OU format. When powered by GPT-4o, it achieves a win rate of 76% against the best existing LLM-based bot and 84% against the strongest rule-based bot, demonstrating its superior performance. Notably, our framework requires no additional LLM training. We evaluate PokéChamp in the popular Gen 9 OU format. When powered by GPT-4o, it achieves a win rate of 76% against the best existing LLM-based bot and 84% against the strongest rule-based bot, demonstrating its superior performance. |
| Researcher Affiliation | Academia | Seth Karten * 1 Andy Luu Nguyen 1 Chi Jin 1 1Princeton University. Correspondence to: Seth Karten <EMAIL>. |
| Pseudocode | No | The paper describes the minimax tree search framework and illustrates it with Figure 4, but does not provide a formal pseudocode block or algorithm steps. |
| Open Source Code | Yes | Videos, code, and dataset are available online. |
| Open Datasets | Yes | This work compiles the largest real-player Pokémon battle dataset, featuring over 3 million games, including more than 500k high-Elo matches. Based on this dataset, we establish a series of battle benchmarks and puzzles to evaluate specific battling skills. We further provide key updates to the local game engine. This work establishes Pokémon as a benchmark to integrate LLM technologies with game-theoretic algorithms addressing general multi-agent problems. Videos, code, and dataset are available online. |
| Dataset Splits | No | The paper mentions compiling a dataset of over 3 million Pokémon battles and using replay data for action prediction, but it does not specify how this dataset was split into training, validation, or test sets for their experiments. |
| Hardware Specification | No | The paper mentions using GPT-4o and Llama 3.1:8b as language models, but it does not specify any hardware details like GPU models, CPU models, or memory used for running the experiments. |
| Software Dependencies | Yes | The LLM agents utilize either Llama3.1:8b (Dubey et al., 2024) or GPT-4o-2024-05-13 (Achiam et al., 2023). |
| Experiment Setup | Yes | We evaluate PokéChamp in the popular Gen 9 OU format. Each experiment consists of at least 25 matches between any two methods, resulting in a minimum of 100 games per method for Elo calculations. The LLM agents utilize either Llama3.1:8b (Dubey et al., 2024) or GPT-4o-2024-05-13 (Achiam et al., 2023). Each player has a total clock time of 150 seconds for the entire match. In addition, each turn has an incremental time limit of 15 seconds. If a player exceeds either the total clock time or the incremental turn time, they automatically lose the match. |