reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Mental Modelling of Reinforcement Learning Agents by Language Models

Authors: Wenhao Lu, Xufeng Zhao, Josua Spisak, Jae Hee Lee, Stefan Wermter

TMLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	This paper empirically examines, for the first time, how well large language models (LLMs) can build a mental model of reinforcement learning (RL) agents, termed agent mental modelling, by reasoning about an agent s behaviour and its effect on states from agent interaction history. This research attempts to unveil the potential of leveraging LLMs for elucidating RL agent behaviour, addressing a key challenge in explainable RL. To this end, we propose specific evaluation metrics and test them on selected RL task datasets of varying complexity, reporting findings on agent mental model establishment.
Researcher Affiliation	Academia	Wenhao Lu EMAIL University of Hamburg Xufeng Zhao EMAIL University of Hamburg Josua Spisak EMAIL University of Hamburg Jae Hee Lee EMAIL University of Hamburg Stefan Wermter EMAIL University of Hamburg
Pseudocode	Yes	Algorithm 1 presents an example pseudo-code for the next action prediction tasks.
Open Source Code	No	The paper does not explicitly state that source code for their methodology is being released or provide a link to a code repository.
Open Datasets	No	The paper describes creating its own dataset of interaction histories: "The dataset of interaction histories (episodes) is collected by running RL agents in each task." While it uses known RL environments (like Mountain Car, Acrobot), the specific interaction data they collected is not stated to be publicly available, nor is a link or citation for its direct access provided.
Dataset Splits	No	The paper mentions using a dataset of "approximately 2000 query samples" and an "offline RL dataset ET for a task T" but does not specify how this data is split into training, validation, or test sets for reproduction of their LLM evaluation setup.
Hardware Specification	No	The paper does not provide specific details about the hardware used to run its experiments, such as GPU/CPU models or specific computing resources.
Software Dependencies	No	The paper mentions using specific LLM models (Llama3-8B2, Llama3-70B, GPT-3.53, and GPT-4o4 models5) and the MuJoCo physics engine, but it does not specify version numbers for any software libraries or dependencies used to implement their evaluation framework.
Experiment Setup	Yes	All language models are prompted with the Chain-of-Thought (Co T) strategy (Wei et al., 2022b), explicitly encouraged to provide reasoning with explanations before jumping to the answer. The in-context learning prompts we constructed consist of task-specific background information, agent behaviour history, and evaluation question prompts (see Appendix B for example instantiated prompts). For tasks with discrete action spaces, LLMs are prompted to output a single integer within the action range. For tasks with continuous actions, we evaluate two options: (1) predicting which bin (from a manually divided set of 10) the next action will fall into, and (2) directly predicting the absolute action value within the valid range for each action dimension. For continuous state prediction, we adopt predicting relative changes (e.g., increase, decrease, unchange) instead of exact state values.