PokerBench: Training Large Language Models to Become Professional Poker Players
Authors: Richard Zhuang, Akshat Gupta, Richard Yang, Aniket Rahane, Zhengyu Li, Gopala Anumanchipalli
AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate prominent models including GPT-4, Chat GPT 3.5, and various Llama and Gemma series models, finding that all state-of-the-art LLMs underperform in playing optimal poker. However, after finetuning, these models show marked improvements. We validate POKERBENCH by having models with different scores compete with each other, demonstrating that higher scores on POKERBENCH lead to higher win rates in actual poker games. |
| Researcher Affiliation | Academia | 1University of California Berkeley, 110 Sproul Hall, Berkeley, CA 94720 USA 2Georgia Institute of Technology, 225 North Avenue NW, Atlanta, GA 30332 USA EMAIL, {brian.li}@gatech.edu |
| Pseudocode | No | The paper describes methods and procedures in prose, but does not include any explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | The dataset and code can be found at https://github.com/pokerllm/pokerbench |
| Open Datasets | Yes | The POKERBENCH benchmark consists of 11,000 most important scenarios... The dataset and code can be found at https://github.com/pokerllm/pokerbench |
| Dataset Splits | Yes | The POKERBENCH benchmark consists of 1k evaluation spots for pre-flop and 10k evaluation spots for post-flop play. Along with the evaluation benchmark, we also release a training set containing 60k pre-flop spots and 500k post-flop spots. ... for both preflop and post-flop settings, we choose a balanced sampling strategy. Precisely, there is an equal percentage of samples each with the correct decision labels being fold, call, check, or bet/raise. |
| Hardware Specification | No | The paper mentions using 'Open AI API' and 'Together AI API' for evaluation and fine-tuning models, but it does not specify any particular hardware (e.g., GPU models, CPU types, or cloud instance configurations) used for their experiments. |
| Software Dependencies | No | The paper mentions using models like GPT-4, Llama-3, Llama-2, and Gemma-2B, and APIs from OpenAI and Together AI, but it does not provide specific version numbers for any software libraries, frameworks, or environments used in their implementation or fine-tuning process. |
| Experiment Setup | Yes | We fine-tune the model for 5000 optimization steps with a batch size of 128 and a learning rate of 1e-6... For generating text, we set temperature = 0.1 and top-p = 0.95 to generate the most probable answer to get statistically stable results. |