reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Competing Large Language Models in Multi-Agent Gaming Environments

Authors: Jen-Tse Huang, Eric John Li, Man Ho LAM, Tian Liang, Wenxuan Wang, Youliang Yuan, Wenxiang Jiao, Xing Wang, Zhaopeng Tu, Michael Lyu

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our results indicate that GPT-3.5 demonstrates strong robustness but limited generalizability, which can be enhanced using methods like Chainof-Thought. We also evaluate 13 LLMs from 6 model families, including GPT3.5, GPT-4, Gemini, LLa MA-3.1, Mixtral, and Qwen-2. Gemini-1.5-Pro outperforms others, scoring of 69.8 out of 100, followed by LLa MA-3.1-70B (65.9) and Mixtral-8x22B (62.4).
Researcher Affiliation	Collaboration	1The Chinese University of Hong Kong 2Tencent AI Lab 3Renmin University of China 4The Chinese University of Hong Kong, Shenzhen 5Tsinghua University
Pseudocode	No	The paper does not contain explicitly labeled pseudocode or algorithm blocks. It describes game rules and prompt structures in text.
Open Source Code	Yes	Our code and experimental results are publicly available at https://github.com/CUHK-ARISE/GAMABench.
Open Datasets	No	The paper uses eight classical game theory scenarios and a dynamic scoring scheme to evaluate LLMs, allowing for dynamic game scene generation. It does not refer to a specific pre-existing, publicly available dataset with concrete access information (link, DOI, or citation to a dataset paper).
Dataset Splits	No	The paper evaluates LLMs in dynamic multi-agent gaming environments over multiple rounds, which does not involve traditional dataset splits for training, validation, and testing like in supervised learning tasks.
Hardware Specification	No	The paper does not explicitly describe the specific hardware (e.g., GPU models, CPU types, memory specifications) used to run the experiments.
Software Dependencies	No	The paper mentions various Large Language Models being evaluated (e.g., GPT-3.5, GPT-4, Gemini), but it does not provide specific software dependencies with version numbers (e.g., Python, PyTorch, CUDA versions) for the experimental setup.
Experiment Setup	Yes	Each game involves ten agents based on GPT-3.5, with the temperature parameter set to one. For simultaneous games, there will be twenty rounds. We run each game five times to enhance the reliability of our findings and mitigate the impact of variance. This study conducts experiments across games employing a range of temperatures {0.0, 0.2, 0.4, 0.6, 0.8, 1.0} under vanilla settings. Further exploration is conducted to ascertain if instructional prompts, such as Chain-of-Thought (Co T) (Kojima et al., 2022), enhance the model s decision-making capabilities. Additionally, the model s capacity to generalize across diverse game settings is examined.