reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

PokerBench: Training Large Language Models to Become Professional Poker Players

Authors: Richard Zhuang, Akshat Gupta, Richard Yang, Aniket Rahane, Zhengyu Li, Gopala Anumanchipalli

AAAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate prominent models including GPT-4, Chat GPT 3.5, and various Llama and Gemma series models, finding that all state-of-the-art LLMs underperform in playing optimal poker. However, after finetuning, these models show marked improvements. We validate POKERBENCH by having models with different scores compete with each other, demonstrating that higher scores on POKERBENCH lead to higher win rates in actual poker games.
Researcher Affiliation	Academia	1University of California Berkeley, 110 Sproul Hall, Berkeley, CA 94720 USA 2Georgia Institute of Technology, 225 North Avenue NW, Atlanta, GA 30332 USA EMAIL, {brian.li}@gatech.edu
Pseudocode	No	The paper describes methods and procedures in prose, but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code	Yes	The dataset and code can be found at https://github.com/pokerllm/pokerbench
Open Datasets	Yes	The POKERBENCH benchmark consists of 11,000 most important scenarios... The dataset and code can be found at https://github.com/pokerllm/pokerbench
Dataset Splits	Yes	The POKERBENCH benchmark consists of 1k evaluation spots for pre-flop and 10k evaluation spots for post-flop play. Along with the evaluation benchmark, we also release a training set containing 60k pre-flop spots and 500k post-flop spots. ... for both preflop and post-flop settings, we choose a balanced sampling strategy. Precisely, there is an equal percentage of samples each with the correct decision labels being fold, call, check, or bet/raise.
Hardware Specification	No	The paper mentions using 'Open AI API' and 'Together AI API' for evaluation and fine-tuning models, but it does not specify any particular hardware (e.g., GPU models, CPU types, or cloud instance configurations) used for their experiments.
Software Dependencies	No	The paper mentions using models like GPT-4, Llama-3, Llama-2, and Gemma-2B, and APIs from OpenAI and Together AI, but it does not provide specific version numbers for any software libraries, frameworks, or environments used in their implementation or fine-tuning process.
Experiment Setup	Yes	We fine-tune the model for 5000 optimization steps with a batch size of 128 and a learning rate of 1e-6... For generating text, we set temperature = 0.1 and top-p = 0.95 to generate the most probable answer to get statistically stable results.