reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

World Models via Policy-Guided Trajectory Diffusion

Authors: Marc Rigter, Jun Yamada, Ingmar Posner

TMLR 2024 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our results demonstrate that Poly GRAD outperforms state-of-the-art baselines in terms of trajectory prediction error for short trajectories, with the exception of autoregressive diffusion. For short trajectories, Poly GRAD obtains similar errors to autoregressive diffusion, but with lower computational requirements. For long trajectories, Poly GRAD obtains comparable performance to baselines. Our experiments demonstrate that Poly GRAD enables performant policies to be trained via on-policy RL in imagination for Mu Jo Co continuous control domains.
Researcher Affiliation	Academia	The provided text only lists the authors' names and the publication venue (Transactions on Machine Learning Research). No explicit institutional affiliations (university names, company names) or email domains are present in the text to classify affiliations.
Pseudocode	Yes	Algorithm 1 Denoising Model Training Algorithm 2 Policy-Guided Trajectory Diffusion (Poly GRAD) Algorithm 3 Imagined RL in Poly GRAD World Model
Open Source Code	Yes	The code for our experiments is available at github.com/marc-rigter/polygrad-world-models.
Open Datasets	Yes	To answer these questions, we run experiments using the Mu Jo Co environments in Open AI Gym (Brockman et al., 2016).
Dataset Splits	No	To collect the datasets, we ran Algorithm 3 until we had collected 1M transitions in each environment. Each world model was then trained on the same 1M transitions collected. The final policy produced by Algorithm 3 was used as the policy for sampling actions in each world model. While the paper describes data collection and total size, it does not specify explicit training, validation, or test splits for the collected data used to train the models.
Hardware Specification	Yes	We continued to train each world model until it obtained the best prediction error evaluation at a 5 step horizon, up to a maximum of either 1M gradient steps or 72 hours of training on an RTX 3090 GPU.
Software Dependencies	No	The paper mentions several software components like PyTorch, Stable Baselines 3, Dreamer-v3 implementation, and rliable framework, but it does not provide specific version numbers for any of them. For example, 'We use the implementation of Dreamer-v3 (Hafner et al., 2023) available at github.com/NM512/dreamerv3-torch.'
Experiment Setup	Yes	Table 1: A2C Hyperparameters Parameter Value Number of imagined trajectories per update 1024 Imagined trajectory length, h 10 Generalised advantage estimation λ 0.9 Critic learning rate 3e-4 Optimiser Adam Discount factor, γ 0.99 Target policy update, target log π 0.01 Entropy bonus weight 1e-5 Minimum policy std. dev. 0.1 Training steps per environment steps 0.25