reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

General agents need world models

Authors: Jonathan Richens, Tom Everitt, David Abel

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We demonstrate our procedure for recovering a world model from an agent, and how the accuracy of the model increases as the agent learns to generalize to more tasks (longer horizon goals). We also investigate if our algorithm can recover the transition function when the agent strongly violates our assumptions (Def. 5). The environment used to test our algorithms is a randomly generated c MP satisfying Assumption 1, comprising of 20 states and 5 actions with a sparse transition function. We train our agent using trajectories sampled from the environment under a random policy, and we increase the competency of our agent by increasing the length of the trajectory it is trained on, Nsamples. See Appendix D for further details on the agent and experimental setup. We recover the world model using Algorithm 2, a simplified version of Algorithm 1. ... Nevertheless, we find that Algorithm 2 recovers the transition function with a low average error (Figure 3 b)), which scales as O(n 1/2), like the error bound in Theorem 1.
Researcher Affiliation	Industry	1Google Deep Mind. Correspondence to: Jonathan Richens <EMAIL>.
Pseudocode	Yes	First we present the pseudocode for the procedure Algorithm 1 used in the proof of Theorem 1 to derive error-bounded estimates of the transition probabilities Pss (a) given the regret-bounded goal-conditioned policy π(at \| ht; ψ). We then present Algorithm 2 an alternative algorithm for estimating Pss (a) which has weaker errors bounds than Algorithm 1 but significantly simplified implementation.
Open Source Code	No	The text does not contain any explicit statement about the release of source code or a link to a code repository for the methodology described in this paper.
Open Datasets	No	The environment used to test our algorithms is a randomly generated c MP satisfying Assumption 1, comprising of 20 states and 5 actions with a sparse transition function. We train our agent using trajectories sampled from the environment under a random policy, and we increase the competency of our agent by increasing the length of the trajectory it is trained on, Nsamples. See Appendix D for further details on the agent and experimental setup.
Dataset Splits	No	The agent is model based, with the model learned from experienced generated by sampling state-action trajectories from the environment under the maximally random policy of a given number of time steps Nsamples {500, 1000, 2000, 3000, 4000, 5000, 6000, 7000, 8000, 9000, 10, 000}. We train 10 agents for each sample size Nsamples, with a different random seed for the experience trajectories, and take the average of the experimental results over the set of agents with the same sample size.
Hardware Specification	No	The paper does not provide specific hardware details (such as GPU or CPU models, memory amounts, or detailed computer specifications) used for running its experiments.
Software Dependencies	No	The paper describes the algorithms and experimental setup but does not specify any software dependencies with version numbers (e.g., programming languages, libraries, or frameworks).
Experiment Setup	Yes	Environment. Our environment is a c MP Def. 1 comprising of 20 states and 5 actions, and satisfying Assumption 1. It has a randomly generated transition function with a sparsity constraint such that each state-action pair has at most 5 outcomes that occur with non-zero probability, so as to ensure that navigating eventually to a given goal-state is non-trivial (e.g. is not achieved by all deterministic policies). Agent. The agent is model based, with the model learned from experienced generated by sampling state-action trajectories from the environment under the maximally random policy of a given number of time steps Nsamples {500, 1000, 2000, 3000, 4000, 5000, 6000, 7000, 8000, 9000, 10, 000}. Experimental setup. We train 10 agents for each sample size Nsamples, with a different random seed for the experience trajectories, and take the average of the experimental results over the set of agents with the same sample size. For each agent we run Algorithm 2 for different max goal depths N {10, 20, 50, 75, 100, 200, 300, 400, 500, 600}...