reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Monte Carlo Planning with Large Language Model for Text-Based Game Agents

Authors: Zijing Shi, Meng Fang, Ling Chen

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We conduct experiments on a series of text-based games from the Jericho benchmark. Our results demonstrate that the MC-DML algorithm significantly enhances performance across various games at the initial planning phase, outperforming strong contemporary methods that require multiple iterations. Additionally, we perform ablation studies to highlight the role of the memory mechanism in LLM policy.
Researcher Affiliation	Academia	Zijing Shi AAII, University of Technology Sydney EMAIL Meng Fang University of Liverpool EMAIL Ling Chen AAII, University of Technology Sydney EMAIL
Pseudocode	Yes	Algorithm 1 Monte Carlo Planning with Dynamic Memory-Guided LLM (MC-DML)
Open Source Code	Yes	Our code is available at https://textgamer.github.io/mc-dml/.
Open Datasets	Yes	We conduct experiments using a series of text-based games from the Jericho benchmark (Hausknecht et al., 2020).
Dataset Splits	No	The paper describes experiments on game environments from the Jericho benchmark. For this type of interactive planning agent, the experiments involve playing full games, not typically splitting a static dataset into training/validation/test sets for supervised learning. Therefore, explicit dataset splits are not provided as they are not directly applicable to the experimental setup of an agent interacting with a game environment.
Hardware Specification	No	The paper mentions using "gpt-3.5-turbo-0125 as the backend model" for the LLM policy, which implies using an API service. However, it does not specify any particular hardware (e.g., GPU, CPU models, or memory) used for running the MCTS algorithm or other components of their implementation.
Software Dependencies	Yes	For the LLM policy, we use gpt-3.5-turbo-0125 as the backend model with a sampling temperature set to 0.
Experiment Setup	Yes	We set the discount factor to 0.95 and the number of simulations to 50 multiplied by len(A). We set Cpuct to 50. Specifically, it is set to 20 for the games Deephome and Library, and to 200 for the game Detective. The LLM policy uses gpt-3.5-turbo-0125 as the backend model with a sampling temperature set to 0. We query the LLM for the index of the optimal action and retrieve the log probabilities for the top 20 tokens at that index. For absent actions, we assign a log probability of -10. These log probabilities are then normalized using softmax with a temperature of 5. The in-trial memory is set to (ot 1, at 1, ot). The size of the cross-trial memory K is set to 3. We set dmin to 10, dmax to 30, and the step increment d to 20.