Competing Large Language Models in Multi-Agent Gaming Environments
Authors: Jen-Tse Huang, Eric John Li, Man Ho LAM, Tian Liang, Wenxuan Wang, Youliang Yuan, Wenxiang Jiao, Xing Wang, Zhaopeng Tu, Michael Lyu
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our results indicate that GPT-3.5 demonstrates strong robustness but limited generalizability, which can be enhanced using methods like Chainof-Thought. We also evaluate 13 LLMs from 6 model families, including GPT3.5, GPT-4, Gemini, LLa MA-3.1, Mixtral, and Qwen-2. Gemini-1.5-Pro outperforms others, scoring of 69.8 out of 100, followed by LLa MA-3.1-70B (65.9) and Mixtral-8x22B (62.4). |
| Researcher Affiliation | Collaboration | 1The Chinese University of Hong Kong 2Tencent AI Lab 3Renmin University of China 4The Chinese University of Hong Kong, Shenzhen 5Tsinghua University |
| Pseudocode | No | The paper does not contain explicitly labeled pseudocode or algorithm blocks. It describes game rules and prompt structures in text. |
| Open Source Code | Yes | Our code and experimental results are publicly available at https://github.com/CUHK-ARISE/GAMABench. |
| Open Datasets | No | The paper uses eight classical game theory scenarios and a dynamic scoring scheme to evaluate LLMs, allowing for dynamic game scene generation. It does not refer to a specific pre-existing, publicly available dataset with concrete access information (link, DOI, or citation to a dataset paper). |
| Dataset Splits | No | The paper evaluates LLMs in dynamic multi-agent gaming environments over multiple rounds, which does not involve traditional dataset splits for training, validation, and testing like in supervised learning tasks. |
| Hardware Specification | No | The paper does not explicitly describe the specific hardware (e.g., GPU models, CPU types, memory specifications) used to run the experiments. |
| Software Dependencies | No | The paper mentions various Large Language Models being evaluated (e.g., GPT-3.5, GPT-4, Gemini), but it does not provide specific software dependencies with version numbers (e.g., Python, PyTorch, CUDA versions) for the experimental setup. |
| Experiment Setup | Yes | Each game involves ten agents based on GPT-3.5, with the temperature parameter set to one. For simultaneous games, there will be twenty rounds. We run each game five times to enhance the reliability of our findings and mitigate the impact of variance. This study conducts experiments across games employing a range of temperatures {0.0, 0.2, 0.4, 0.6, 0.8, 1.0} under vanilla settings. Further exploration is conducted to ascertain if instructional prompts, such as Chain-of-Thought (Co T) (Kojima et al., 2022), enhance the model s decision-making capabilities. Additionally, the model s capacity to generalize across diverse game settings is examined. |