GameArena: Evaluating LLM Reasoning through Live Computer Games

Authors: Lanxiang Hu, Qiyu Li, Anze Xie, Nan Jiang, Ion Stoica, Haojian Jin, Hao Zhang

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We collect over 2000 game sessions and provide detailed assessments of various reasoning capabilities for five state-of-the-art LLMs. Our user study with 100 participants suggests that Game Arena improves user engagement compared to Chatbot Arena. For the first time, Game Arena enables the collection of step-by-step LLM reasoning data in the wild.
Researcher Affiliation Academia Lanxiang Hu 1, Qiyu Li 1, Anze Xie 1, Nan Jiang1, Ion Stoica2, Haojian Jin 1, Hao Zhang 1 1University of California, San Diego 2University of California, Berkeley
Pseudocode No The paper describes game formulation and evaluation metrics textually, but does not include any explicit pseudocode or algorithm blocks.
Open Source Code Yes Our code is available at https://github.com/lmgame-org.
Open Datasets No We will release our gaming data for future research.
Dataset Splits Yes Over a 10-week period from July 2024 to September 2024, we collect a total of 2240 game sessions using Cloud Research (Hartman et al., 2023) for evaluation. We then conduct retrospective data analysis introduced in Section 3 on the gaming data to obtain outcome metrics and procedure metrics for each model. ... Each optimizer is optimized from a subset of all our gaming data consisting of 200 game sessions. ... For each game, the two sets each contains 50 game sessions.
Hardware Specification No The paper discusses evaluating LLMs and collecting game sessions, but does not provide specific details about the hardware (e.g., GPU models, CPU types, or memory) used for running the models or experiments.
Software Dependencies No In Game Arena, we develop a system prompt search and optimization pipeline using DSPy (Khattab et al., 2024).
Experiment Setup Yes To reduce potential prompt bias, we developed five optimized system prompts using DSPy (Khattab et al., 2024) and randomly selected one for each game session. We collected more than 2000 game sessions and analyzed the data to score each LLM based on capability-specific evaluation metrics. ... This pipeline uses the chain-of-thought module from DSPy and one of the five models acting as the evaluator. The evaluator guide the model to follow the game rules and judges on response quality. The DSPy optimizer learns to bootstrap and identify effective system prompts. Each optimizer is optimized from a subset of all our gaming data consisting of 200 game sessions. Given a game and one of the five evaluators, the pipeline searches for an optimal system prompt that maximizes the evaluator s judgment on the model s generated questions or answers at each round, resulting in five highly optimized system prompts per game.