Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1]
World Models via Policy-Guided Trajectory Diffusion
Authors: Marc Rigter, Jun Yamada, Ingmar Posner
TMLR 2024 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our results demonstrate that Poly GRAD outperforms state-of-the-art baselines in terms of trajectory prediction error for short trajectories, with the exception of autoregressive diffusion. For short trajectories, Poly GRAD obtains similar errors to autoregressive diffusion, but with lower computational requirements. For long trajectories, Poly GRAD obtains comparable performance to baselines. Our experiments demonstrate that Poly GRAD enables performant policies to be trained via on-policy RL in imagination for Mu Jo Co continuous control domains. |
| Researcher Affiliation | Academia | The provided text only lists the authors' names and the publication venue (Transactions on Machine Learning Research). No explicit institutional affiliations (university names, company names) or email domains are present in the text to classify affiliations. |
| Pseudocode | Yes | Algorithm 1 Denoising Model Training Algorithm 2 Policy-Guided Trajectory Diffusion (Poly GRAD) Algorithm 3 Imagined RL in Poly GRAD World Model |
| Open Source Code | Yes | The code for our experiments is available at github.com/marc-rigter/polygrad-world-models. |
| Open Datasets | Yes | To answer these questions, we run experiments using the Mu Jo Co environments in Open AI Gym (Brockman et al., 2016). |
| Dataset Splits | No | To collect the datasets, we ran Algorithm 3 until we had collected 1M transitions in each environment. Each world model was then trained on the same 1M transitions collected. The final policy produced by Algorithm 3 was used as the policy for sampling actions in each world model. While the paper describes data collection and total size, it does not specify explicit training, validation, or test splits for the collected data used to train the models. |
| Hardware Specification | Yes | We continued to train each world model until it obtained the best prediction error evaluation at a 5 step horizon, up to a maximum of either 1M gradient steps or 72 hours of training on an RTX 3090 GPU. |
| Software Dependencies | No | The paper mentions several software components like PyTorch, Stable Baselines 3, Dreamer-v3 implementation, and rliable framework, but it does not provide specific version numbers for any of them. For example, 'We use the implementation of Dreamer-v3 (Hafner et al., 2023) available at github.com/NM512/dreamerv3-torch.' |
| Experiment Setup | Yes | Table 1: A2C Hyperparameters Parameter Value Number of imagined trajectories per update 1024 Imagined trajectory length, h 10 Generalised advantage estimation λ 0.9 Critic learning rate 3e-4 Optimiser Adam Discount factor, γ 0.99 Target policy update, target log π 0.01 Entropy bonus weight 1e-5 Minimum policy std. dev. 0.1 Training steps per environment steps 0.25 |