reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Accurate and Efficient World Modeling with Masked Latent Transformers

Authors: Maxime Burchi, Radu Timofte

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	On the Crafter benchmark, EMERALD achieves new state-of-the-art performance, becoming the first method to surpass human experts performance within 10M environment steps. Our method also succeeds to unlock all 22 Crafter achievements at least once during evaluation. [...] In this section, we describe our experiments on the Crafter benchmark. We show the results obtained by EMERALD in Table 2. We also perform an ablation study on the world model architecture in section 4.3. Finally, we analyze the impact of the number of decoding steps on world model predictions in section 4.4.
Researcher Affiliation	Academia	1Computer Vision Lab, CAIDAS & IFI, University of W urzburg, Germany. Correspondence to: Maxime Burchi <EMAIL>.
Pseudocode	No	The paper includes figures illustrating architectures (e.g., Figure 3: Efficient masked latent Transformer-based world model) and describing components, but no explicit 'Pseudocode' or 'Algorithm' blocks with structured, code-like steps are provided.
Open Source Code	No	The paper does not contain an explicit statement about the release of source code for the described methodology, nor does it provide a link to a code repository.
Open Datasets	Yes	On the Crafter benchmark, EMERALD achieves new state-of-the-art performance... Crafter benchmark was proposed in Hafner (2022)... Additionally, we report results on the commonly used Atari 100k benchmark (Kaiser et al., 2020) to demonstrate the general efficacy of EMERALD on Atari games that do not necessarily require the use of spatial latents to achieve near perfect reconstruction.
Dataset Splits	Yes	On the Crafter benchmark, EMERALD achieves new state-of-the-art performance, becoming the first method to surpass human experts performance within 10M environment steps. Our method also succeeds to unlock all 22 Crafter achievements at least once during evaluation. [...] We show achievements success rates over 256 evaluation episodes after training 10M environment steps. [...] We evaluate our method on the benchmark to assess EMERALD s performance on environments that do not necessarily require the use of spatial latents to achieve near perfect reconstruction. We also demonstrate improved training efficiency compared to -IRIS and DIAMOND. Following preceding works, we use human-normalized metrics and compare the mean and median returns across all 26 games. The human-normalized scores are computed for each game using the scores achieved by a human player and the scores obtained by a random policy: normed score = agent score random score human score random score.
Hardware Specification	Yes	Analogously to Micheli et al. (2024), we also compare the number of collected Frames Per Second (FPS) using a single RTX 3090 GPU for training.
Software Dependencies	No	The paper mentions optimizers like 'Adam' and implicitly uses deep learning frameworks, but it does not specify any software libraries or frameworks with their version numbers required for reproduction.
Experiment Setup	Yes	Table 10: EMERALD hyper-parameters. Image resolution 64x64, Batch size (B) 16, Sequence length (T) 64, Optimizer Adam, Environment parallel instances 16, Collected frames per training Step 16, Replay buffer capacity 1M, Latent space size (H W G) 4x4x32, Number categories per group 32, Temporal Transformer blocks 4, Temporal Transformer width 512, Spatial Mask GIT blocks 2, Spatial Mask GIT width 256, Number of attention heads 8, Dropout probability 0.1, Attention context length 64, Learning rate 10-4, Gradient clipping 1000, Imagination horizon (H) 15, Number of decoding steps (S) 3, Return discount 0.997, Return lambda 0.95, Critic EMA Decay 0.98, Return normalization Momentum 0.99, Actor entropy Scale 3e-4, Learning rate 3e-5, Gradient clipping 100.