Mental Modelling of Reinforcement Learning Agents by Language Models
Authors: Wenhao Lu, Xufeng Zhao, Josua Spisak, Jae Hee Lee, Stefan Wermter
TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | This paper empirically examines, for the first time, how well large language models (LLMs) can build a mental model of reinforcement learning (RL) agents, termed agent mental modelling, by reasoning about an agent s behaviour and its effect on states from agent interaction history. This research attempts to unveil the potential of leveraging LLMs for elucidating RL agent behaviour, addressing a key challenge in explainable RL. To this end, we propose specific evaluation metrics and test them on selected RL task datasets of varying complexity, reporting findings on agent mental model establishment. |
| Researcher Affiliation | Academia | Wenhao Lu EMAIL University of Hamburg Xufeng Zhao EMAIL University of Hamburg Josua Spisak EMAIL University of Hamburg Jae Hee Lee EMAIL University of Hamburg Stefan Wermter EMAIL University of Hamburg |
| Pseudocode | Yes | Algorithm 1 presents an example pseudo-code for the next action prediction tasks. |
| Open Source Code | No | The paper does not explicitly state that source code for their methodology is being released or provide a link to a code repository. |
| Open Datasets | No | The paper describes creating its own dataset of interaction histories: "The dataset of interaction histories (episodes) is collected by running RL agents in each task." While it uses known RL environments (like Mountain Car, Acrobot), the specific interaction data they collected is not stated to be publicly available, nor is a link or citation for its direct access provided. |
| Dataset Splits | No | The paper mentions using a dataset of "approximately 2000 query samples" and an "offline RL dataset ET for a task T" but does not specify how this data is split into training, validation, or test sets for reproduction of their LLM evaluation setup. |
| Hardware Specification | No | The paper does not provide specific details about the hardware used to run its experiments, such as GPU/CPU models or specific computing resources. |
| Software Dependencies | No | The paper mentions using specific LLM models (Llama3-8B2, Llama3-70B, GPT-3.53, and GPT-4o4 models5) and the MuJoCo physics engine, but it does not specify version numbers for any software libraries or dependencies used to implement their evaluation framework. |
| Experiment Setup | Yes | All language models are prompted with the Chain-of-Thought (Co T) strategy (Wei et al., 2022b), explicitly encouraged to provide reasoning with explanations before jumping to the answer. The in-context learning prompts we constructed consist of task-specific background information, agent behaviour history, and evaluation question prompts (see Appendix B for example instantiated prompts). For tasks with discrete action spaces, LLMs are prompted to output a single integer within the action range. For tasks with continuous actions, we evaluate two options: (1) predicting which bin (from a manually divided set of 10) the next action will fall into, and (2) directly predicting the absolute action value within the valid range for each action dimension. For continuous state prediction, we adopt predicting relative changes (e.g., increase, decrease, unchange) instead of exact state values. |