reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Strategist: Self-improvement of LLM Decision Making via Bi-Level Tree Search

Authors: Jonathan Light, Min Cai, Weiqin Chen, Guanzhi Wang, Xiusi Chen, Wei Cheng, Yisong Yue, Ziniu Hu

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We demonstrate the effectiveness of STRATEGIST in learning optimal strategies for competitive, multi-turn games with partial information, including Game of Pure Strategy (GOPS) and multi-agent, hidden-identity discussion games like The Resistance: Avalon. Our results show that agents equipped with STRATEGIST outperform those trained with traditional RL methods, other LLM-based skill acquisition techniques, pre-existing LLM agents across both game environments and achieves comparable performance against human players.
Researcher Affiliation	Collaboration	Jonathan Light1 Min Cai2 Weiqin Chen1 Guanzhi Wang5 Xiusi Chen3 Wei Cheng4 Yisong Yue5 Ziniu Hu5 1Rensselaer Polytechnic Institute, 2Shenzhen University, 3University of California, Los Angeles, 4NEC laboratories America, 5California Institute of Technology
Pseudocode	Yes	Algorithm 1: STRATEGIST Pseudocode
Open Source Code	No	The paper provides a project website link (https://llm-strategist.github.io) but does not explicitly state that the source code for the described methodology is available there, nor does it provide a direct link to a code repository.
Open Datasets	No	STRATEGIST is a generalizable framework to optimize the strategy through population-based self-play simulations without the need for any training data. We demonstrate the effectiveness of STRATEGIST in two challenging games: the Game of Pure Strategy (GOPS) and Resistance: Avalon (Ross, 1971; Light et al., 2023).
Dataset Splits	No	The paper relies on self-play simulations within game environments (GOPS and Avalon) and explicitly states it does not require training data, thus conventional dataset splits are not applicable.
Hardware Specification	Yes	All experiments in this work were performed on a workstation with an NVIDIA Ge Force RTX 3070 GPU, Intel Core i9-10900 CPU at 2.80 GHz, and a Macbook Pro.
Software Dependencies	No	The paper mentions using GPT3.5 and GPT4.0 as LLMs, and discusses neural network structures, but does not provide specific version numbers for programming languages (e.g., Python), libraries (e.g., PyTorch, TensorFlow), or other key software dependencies.
Experiment Setup	Yes	We employ Monte-Carlo based RL approach (Sutton & Barto (2018)) to train a value heuristic for both five-player Avalon and five-card GOPS games. To do so, we construct a MSE loss in each episode for training the value function... For Avalon, we consider 20 evolutions (epochs) for the training process. At the end of each evolution, 30 batch runs (episodes) are generated and used to train the value function network, i.e., a total of 600 episodes for training. In GOPS, we train by 20 evolutions as well while considering 60 batch runs each (1200 episodes in total)... The neural network is constructed by a multilayer perceptron (MLP) with 2 hidden layers. We select a hidden layer size of 128 128 for Avalon and that of 64 64 for GOPS. Likewise, the chosen learning rates are 5e 4 and 8e 4, respectively.