reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

GameArena: Evaluating LLM Reasoning through Live Computer Games

Authors: Lanxiang Hu, Qiyu Li, Anze Xie, Nan Jiang, Ion Stoica, Haojian Jin, Hao Zhang

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We collect over 2000 game sessions and provide detailed assessments of various reasoning capabilities for five state-of-the-art LLMs. Our user study with 100 participants suggests that Game Arena improves user engagement compared to Chatbot Arena. For the first time, Game Arena enables the collection of step-by-step LLM reasoning data in the wild.
Researcher Affiliation	Academia	Lanxiang Hu 1, Qiyu Li 1, Anze Xie 1, Nan Jiang1, Ion Stoica2, Haojian Jin 1, Hao Zhang 1 1University of California, San Diego 2University of California, Berkeley
Pseudocode	No	The paper describes game formulation and evaluation metrics textually, but does not include any explicit pseudocode or algorithm blocks.
Open Source Code	Yes	Our code is available at https://github.com/lmgame-org.
Open Datasets	No	We will release our gaming data for future research.
Dataset Splits	Yes	Over a 10-week period from July 2024 to September 2024, we collect a total of 2240 game sessions using Cloud Research (Hartman et al., 2023) for evaluation. We then conduct retrospective data analysis introduced in Section 3 on the gaming data to obtain outcome metrics and procedure metrics for each model. ... Each optimizer is optimized from a subset of all our gaming data consisting of 200 game sessions. ... For each game, the two sets each contains 50 game sessions.
Hardware Specification	No	The paper discusses evaluating LLMs and collecting game sessions, but does not provide specific details about the hardware (e.g., GPU models, CPU types, or memory) used for running the models or experiments.
Software Dependencies	No	In Game Arena, we develop a system prompt search and optimization pipeline using DSPy (Khattab et al., 2024).
Experiment Setup	Yes	To reduce potential prompt bias, we developed five optimized system prompts using DSPy (Khattab et al., 2024) and randomly selected one for each game session. We collected more than 2000 game sessions and analyzed the data to score each LLM based on capability-specific evaluation metrics. ... This pipeline uses the chain-of-thought module from DSPy and one of the five models acting as the evaluator. The evaluator guide the model to follow the game rules and judges on response quality. The DSPy optimizer learns to bootstrap and identify effective system prompts. Each optimizer is optimized from a subset of all our gaming data consisting of 200 game sessions. Given a game and one of the five evaluators, the pipeline searches for an optimal system prompt that maximizes the evaluator s judgment on the model s generated questions or answers at each round, resulting in five highly optimized system prompts per game.