Simple, Good, Fast: Self-Supervised World Models Free of Baggage
Authors: Jan Robine, Marc Höftmann, Stefan Harmeling
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | This paper introduces SGF, a Simple, Good, and Fast world model that uses self-supervised representation learning, captures short-time dependencies through frame and action stacking, and enhances robustness against model errors through data augmentation. We extensively discuss SGF s connections to established world models, evaluate the building blocks in ablation studies, and demonstrate good performance through quantitative comparisons on the Atari 100k benchmark. |
| Researcher Affiliation | Academia | Jan Robine,1,2 Marc H oftmann1,2 & Stefan Harmeling1,2 1TU Dortmund, 2Lamarr Institute for Machine Learning and Artificial Intelligence EMAIL |
| Pseudocode | Yes | The pseudocode outlining our world model and policy training procedure is presented in Algorithm 1. |
| Open Source Code | Yes | The code is available at https://github.com/jrobine/sgf. |
| Open Datasets | Yes | We evaluate our world model on the Atari 100k benchmark, which was first proposed by Kaiser et al. (2020) and has been used to evaluate many sample-efficient reinforcement learning methods (Laskin et al., 2020b; Yarats et al., 2021; Schwarzer et al., 2021a; 2023; Micheli et al., 2023; Hafner et al., 2023). |
| Dataset Splits | Yes | We evaluate our world model on the Atari 100k benchmark, which was first proposed by Kaiser et al. (2020) and has been used to evaluate many sample-efficient reinforcement learning methods (Laskin et al., 2020b; Yarats et al., 2021; Schwarzer et al., 2021a; 2023; Micheli et al., 2023; Hafner et al., 2023). It includes a subset of 26 Atari games from the Arcade Learning Environment (Bellemare et al., 2013) and is limited to 400k environment steps, which amounts to 100k steps after frame skipping or 2 hours of human gameplay. Note that all games are deterministic (Machado et al., 2018). We perform 10 runs per game and for each run we compute the average score over 100 episodes at the end of training. |
| Hardware Specification | Yes | Training SGF takes 1.5 hours on a single NVIDIA A100 GPU. Obtaining precise training times for other methods is challenging, as they depend on the GPU. Following Hafner et al. (2023), we approximate runtimes for an NVIDIA V100 GPU, assuming NVIDIA P100 GPUs are twice as slow and NVIDIA A100 GPUs are twice as fast. |
| Software Dependencies | No | The paper mentions software components like Si LU nonlinearities, layer normalization, and the Adam W optimizer, but it does not provide specific version numbers for any software libraries, frameworks (e.g., PyTorch, TensorFlow), or programming languages (e.g., Python). |
| Experiment Setup | Yes | Appendix F IMPLEMENTATION DETAILS provides extensive details on stacking, preprocessing, distributions (normal, discrete regression, Bernoulli), and network architectures, including convolutional layer kernel size, stride, padding, linear layer dimensions (d=512, D=2048), MLP hidden layer dimensions (2048, 1024), and optimizer (AdamW). Table 7, titled "Summary of all hyperparameters," explicitly lists values for: Dimensionality of y (d=512), Dimensionality of z (D=2048), Consistency coefficient (η=12.5), Covariance coefficient (ρ=1.0), Variance coefficient (ν=25.0), Frame resolution (64x64), Frame and action stacking (m=4), Discount factor (γ=0.997), λ-return parameter (λ=0.95), Entropy coefficient (1e-3), Target network decay (0.98), World model training interval (Every 2nd environment step), Policy training interval (Every 2nd environment step), Environment steps (100,000), Initial random steps (5000), World model batch size (1024), World model learning rate (6e-4), World model warmup steps (5000), World model weight decay (1e-3), World model gradient clipping (10.0), Imagination batch size (3072), Imagination horizon (H=10), Actor-critic learning rate (2.4e-4), Actor-critic gradient clipping (100.0), Policy temperature for evaluation (0.5), and Random actions during collection (1%). |