Monte Carlo Planning with Large Language Model for Text-Based Game Agents

Authors: Zijing Shi, Meng Fang, Ling Chen

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We conduct experiments on a series of text-based games from the Jericho benchmark. Our results demonstrate that the MC-DML algorithm significantly enhances performance across various games at the initial planning phase, outperforming strong contemporary methods that require multiple iterations. Additionally, we perform ablation studies to highlight the role of the memory mechanism in LLM policy.
Researcher Affiliation Academia Zijing Shi AAII, University of Technology Sydney EMAIL Meng Fang University of Liverpool EMAIL Ling Chen AAII, University of Technology Sydney EMAIL
Pseudocode Yes Algorithm 1 Monte Carlo Planning with Dynamic Memory-Guided LLM (MC-DML)
Open Source Code Yes Our code is available at https://textgamer.github.io/mc-dml/.
Open Datasets Yes We conduct experiments using a series of text-based games from the Jericho benchmark (Hausknecht et al., 2020).
Dataset Splits No The paper describes experiments on game environments from the Jericho benchmark. For this type of interactive planning agent, the experiments involve playing full games, not typically splitting a static dataset into training/validation/test sets for supervised learning. Therefore, explicit dataset splits are not provided as they are not directly applicable to the experimental setup of an agent interacting with a game environment.
Hardware Specification No The paper mentions using "gpt-3.5-turbo-0125 as the backend model" for the LLM policy, which implies using an API service. However, it does not specify any particular hardware (e.g., GPU, CPU models, or memory) used for running the MCTS algorithm or other components of their implementation.
Software Dependencies Yes For the LLM policy, we use gpt-3.5-turbo-0125 as the backend model with a sampling temperature set to 0.
Experiment Setup Yes We set the discount factor to 0.95 and the number of simulations to 50 multiplied by len(A). We set Cpuct to 50. Specifically, it is set to 20 for the games Deephome and Library, and to 200 for the game Detective. The LLM policy uses gpt-3.5-turbo-0125 as the backend model with a sampling temperature set to 0. We query the LLM for the index of the optimal action and retrieve the log probabilities for the top 20 tokens at that index. For absent actions, we assign a log probability of -10. These log probabilities are then normalized using softmax with a temperature of 5. The in-trial memory is set to (ot 1, at 1, ot). The size of the cross-trial memory K is set to 3. We set dmin to 10, dmax to 30, and the step increment d to 20.