General agents need world models
Authors: Jonathan Richens, Tom Everitt, David Abel
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We demonstrate our procedure for recovering a world model from an agent, and how the accuracy of the model increases as the agent learns to generalize to more tasks (longer horizon goals). We also investigate if our algorithm can recover the transition function when the agent strongly violates our assumptions (Def. 5). The environment used to test our algorithms is a randomly generated c MP satisfying Assumption 1, comprising of 20 states and 5 actions with a sparse transition function. We train our agent using trajectories sampled from the environment under a random policy, and we increase the competency of our agent by increasing the length of the trajectory it is trained on, Nsamples. See Appendix D for further details on the agent and experimental setup. We recover the world model using Algorithm 2, a simplified version of Algorithm 1. ... Nevertheless, we find that Algorithm 2 recovers the transition function with a low average error (Figure 3 b)), which scales as O(n 1/2), like the error bound in Theorem 1. |
| Researcher Affiliation | Industry | 1Google Deep Mind. Correspondence to: Jonathan Richens <EMAIL>. |
| Pseudocode | Yes | First we present the pseudocode for the procedure Algorithm 1 used in the proof of Theorem 1 to derive error-bounded estimates of the transition probabilities Pss (a) given the regret-bounded goal-conditioned policy π(at | ht; ψ). We then present Algorithm 2 an alternative algorithm for estimating Pss (a) which has weaker errors bounds than Algorithm 1 but significantly simplified implementation. |
| Open Source Code | No | The text does not contain any explicit statement about the release of source code or a link to a code repository for the methodology described in this paper. |
| Open Datasets | No | The environment used to test our algorithms is a randomly generated c MP satisfying Assumption 1, comprising of 20 states and 5 actions with a sparse transition function. We train our agent using trajectories sampled from the environment under a random policy, and we increase the competency of our agent by increasing the length of the trajectory it is trained on, Nsamples. See Appendix D for further details on the agent and experimental setup. |
| Dataset Splits | No | The agent is model based, with the model learned from experienced generated by sampling state-action trajectories from the environment under the maximally random policy of a given number of time steps Nsamples {500, 1000, 2000, 3000, 4000, 5000, 6000, 7000, 8000, 9000, 10, 000}. We train 10 agents for each sample size Nsamples, with a different random seed for the experience trajectories, and take the average of the experimental results over the set of agents with the same sample size. |
| Hardware Specification | No | The paper does not provide specific hardware details (such as GPU or CPU models, memory amounts, or detailed computer specifications) used for running its experiments. |
| Software Dependencies | No | The paper describes the algorithms and experimental setup but does not specify any software dependencies with version numbers (e.g., programming languages, libraries, or frameworks). |
| Experiment Setup | Yes | Environment. Our environment is a c MP Def. 1 comprising of 20 states and 5 actions, and satisfying Assumption 1. It has a randomly generated transition function with a sparsity constraint such that each state-action pair has at most 5 outcomes that occur with non-zero probability, so as to ensure that navigating eventually to a given goal-state is non-trivial (e.g. is not achieved by all deterministic policies). Agent. The agent is model based, with the model learned from experienced generated by sampling state-action trajectories from the environment under the maximally random policy of a given number of time steps Nsamples {500, 1000, 2000, 3000, 4000, 5000, 6000, 7000, 8000, 9000, 10, 000}. Experimental setup. We train 10 agents for each sample size Nsamples, with a different random seed for the experience trajectories, and take the average of the experimental results over the set of agents with the same sample size. For each agent we run Algorithm 2 for different max goal depths N {10, 20, 50, 75, 100, 200, 300, 400, 500, 600}... |