PokerBench: Training Large Language Models to Become Professional Poker Players

Authors: Richard Zhuang, Akshat Gupta, Richard Yang, Aniket Rahane, Zhengyu Li, Gopala Anumanchipalli

AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate prominent models including GPT-4, Chat GPT 3.5, and various Llama and Gemma series models, finding that all state-of-the-art LLMs underperform in playing optimal poker. However, after finetuning, these models show marked improvements. We validate POKERBENCH by having models with different scores compete with each other, demonstrating that higher scores on POKERBENCH lead to higher win rates in actual poker games.
Researcher Affiliation Academia 1University of California Berkeley, 110 Sproul Hall, Berkeley, CA 94720 USA 2Georgia Institute of Technology, 225 North Avenue NW, Atlanta, GA 30332 USA EMAIL, {brian.li}@gatech.edu
Pseudocode No The paper describes methods and procedures in prose, but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code Yes The dataset and code can be found at https://github.com/pokerllm/pokerbench
Open Datasets Yes The POKERBENCH benchmark consists of 11,000 most important scenarios... The dataset and code can be found at https://github.com/pokerllm/pokerbench
Dataset Splits Yes The POKERBENCH benchmark consists of 1k evaluation spots for pre-flop and 10k evaluation spots for post-flop play. Along with the evaluation benchmark, we also release a training set containing 60k pre-flop spots and 500k post-flop spots. ... for both preflop and post-flop settings, we choose a balanced sampling strategy. Precisely, there is an equal percentage of samples each with the correct decision labels being fold, call, check, or bet/raise.
Hardware Specification No The paper mentions using 'Open AI API' and 'Together AI API' for evaluation and fine-tuning models, but it does not specify any particular hardware (e.g., GPU models, CPU types, or cloud instance configurations) used for their experiments.
Software Dependencies No The paper mentions using models like GPT-4, Llama-3, Llama-2, and Gemma-2B, and APIs from OpenAI and Together AI, but it does not provide specific version numbers for any software libraries, frameworks, or environments used in their implementation or fine-tuning process.
Experiment Setup Yes We fine-tune the model for 5000 optimization steps with a batch size of 128 and a learning rate of 1e-6... For generating text, we set temperature = 0.1 and top-p = 0.95 to generate the most probable answer to get statistically stable results.