Strategist: Self-improvement of LLM Decision Making via Bi-Level Tree Search

Authors: Jonathan Light, Min Cai, Weiqin Chen, Guanzhi Wang, Xiusi Chen, Wei Cheng, Yisong Yue, Ziniu Hu

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We demonstrate the effectiveness of STRATEGIST in learning optimal strategies for competitive, multi-turn games with partial information, including Game of Pure Strategy (GOPS) and multi-agent, hidden-identity discussion games like The Resistance: Avalon. Our results show that agents equipped with STRATEGIST outperform those trained with traditional RL methods, other LLM-based skill acquisition techniques, pre-existing LLM agents across both game environments and achieves comparable performance against human players.
Researcher Affiliation Collaboration Jonathan Light1 Min Cai2 Weiqin Chen1 Guanzhi Wang5 Xiusi Chen3 Wei Cheng4 Yisong Yue5 Ziniu Hu5 1Rensselaer Polytechnic Institute, 2Shenzhen University, 3University of California, Los Angeles, 4NEC laboratories America, 5California Institute of Technology
Pseudocode Yes Algorithm 1: STRATEGIST Pseudocode
Open Source Code No The paper provides a project website link (https://llm-strategist.github.io) but does not explicitly state that the source code for the described methodology is available there, nor does it provide a direct link to a code repository.
Open Datasets No STRATEGIST is a generalizable framework to optimize the strategy through population-based self-play simulations without the need for any training data. We demonstrate the effectiveness of STRATEGIST in two challenging games: the Game of Pure Strategy (GOPS) and Resistance: Avalon (Ross, 1971; Light et al., 2023).
Dataset Splits No The paper relies on self-play simulations within game environments (GOPS and Avalon) and explicitly states it does not require training data, thus conventional dataset splits are not applicable.
Hardware Specification Yes All experiments in this work were performed on a workstation with an NVIDIA Ge Force RTX 3070 GPU, Intel Core i9-10900 CPU at 2.80 GHz, and a Macbook Pro.
Software Dependencies No The paper mentions using GPT3.5 and GPT4.0 as LLMs, and discusses neural network structures, but does not provide specific version numbers for programming languages (e.g., Python), libraries (e.g., PyTorch, TensorFlow), or other key software dependencies.
Experiment Setup Yes We employ Monte-Carlo based RL approach (Sutton & Barto (2018)) to train a value heuristic for both five-player Avalon and five-card GOPS games. To do so, we construct a MSE loss in each episode for training the value function... For Avalon, we consider 20 evolutions (epochs) for the training process. At the end of each evolution, 30 batch runs (episodes) are generated and used to train the value function network, i.e., a total of 600 episodes for training. In GOPS, we train by 20 evolutions as well while considering 60 batch runs each (1200 episodes in total)... The neural network is constructed by a multilayer perceptron (MLP) with 2 hidden layers. We select a hidden layer size of 128 128 for Avalon and that of 64 64 for GOPS. Likewise, the chosen learning rates are 5e 4 and 8e 4, respectively.