Accurate and Efficient World Modeling with Masked Latent Transformers
Authors: Maxime Burchi, Radu Timofte
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | On the Crafter benchmark, EMERALD achieves new state-of-the-art performance, becoming the first method to surpass human experts performance within 10M environment steps. Our method also succeeds to unlock all 22 Crafter achievements at least once during evaluation. [...] In this section, we describe our experiments on the Crafter benchmark. We show the results obtained by EMERALD in Table 2. We also perform an ablation study on the world model architecture in section 4.3. Finally, we analyze the impact of the number of decoding steps on world model predictions in section 4.4. |
| Researcher Affiliation | Academia | 1Computer Vision Lab, CAIDAS & IFI, University of W urzburg, Germany. Correspondence to: Maxime Burchi <EMAIL>. |
| Pseudocode | No | The paper includes figures illustrating architectures (e.g., Figure 3: Efficient masked latent Transformer-based world model) and describing components, but no explicit 'Pseudocode' or 'Algorithm' blocks with structured, code-like steps are provided. |
| Open Source Code | No | The paper does not contain an explicit statement about the release of source code for the described methodology, nor does it provide a link to a code repository. |
| Open Datasets | Yes | On the Crafter benchmark, EMERALD achieves new state-of-the-art performance... Crafter benchmark was proposed in Hafner (2022)... Additionally, we report results on the commonly used Atari 100k benchmark (Kaiser et al., 2020) to demonstrate the general efficacy of EMERALD on Atari games that do not necessarily require the use of spatial latents to achieve near perfect reconstruction. |
| Dataset Splits | Yes | On the Crafter benchmark, EMERALD achieves new state-of-the-art performance, becoming the first method to surpass human experts performance within 10M environment steps. Our method also succeeds to unlock all 22 Crafter achievements at least once during evaluation. [...] We show achievements success rates over 256 evaluation episodes after training 10M environment steps. [...] We evaluate our method on the benchmark to assess EMERALD s performance on environments that do not necessarily require the use of spatial latents to achieve near perfect reconstruction. We also demonstrate improved training efficiency compared to -IRIS and DIAMOND. Following preceding works, we use human-normalized metrics and compare the mean and median returns across all 26 games. The human-normalized scores are computed for each game using the scores achieved by a human player and the scores obtained by a random policy: normed score = agent score random score human score random score. |
| Hardware Specification | Yes | Analogously to Micheli et al. (2024), we also compare the number of collected Frames Per Second (FPS) using a single RTX 3090 GPU for training. |
| Software Dependencies | No | The paper mentions optimizers like 'Adam' and implicitly uses deep learning frameworks, but it does not specify any software libraries or frameworks with their version numbers required for reproduction. |
| Experiment Setup | Yes | Table 10: EMERALD hyper-parameters. Image resolution 64x64, Batch size (B) 16, Sequence length (T) 64, Optimizer Adam, Environment parallel instances 16, Collected frames per training Step 16, Replay buffer capacity 1M, Latent space size (H W G) 4x4x32, Number categories per group 32, Temporal Transformer blocks 4, Temporal Transformer width 512, Spatial Mask GIT blocks 2, Spatial Mask GIT width 256, Number of attention heads 8, Dropout probability 0.1, Attention context length 64, Learning rate 10-4, Gradient clipping 1000, Imagination horizon (H) 15, Number of decoding steps (S) 3, Return discount 0.997, Return lambda 0.95, Critic EMA Decay 0.98, Return normalization Momentum 0.99, Actor entropy Scale 3e-4, Learning rate 3e-5, Gradient clipping 100. |