Diffusion Models Are Real-Time Game Engines
Authors: Dani Valevski, Yaniv Leviathan, Moab Arar, Shlomi Fruchter
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We measure LPIPS (Zhang et al., 2018) and PSNR using the teacher-forcing setup described in Section 2, where we sample an initial state and predict a single frame based on a trajectory of ground-truth past observations. When evaluated over a random holdout of 2048 trajectories taken in 5 different levels, our model achieves a PSNR of 29.43 and an LPIPS of 0.249. The PSNR value is similar to lossy JPEG compression with quality settings of 20-30 (Petric & Milinkovic, 2018). Figure 5 shows examples of model predictions and the corresponding ground truth samples. To evaluate the importance of the different components of our methods, we sample trajectories from the evaluation dataset and compute LPIPS and PSNR metrics between the ground truth and the predicted frames. Human raters are only slightly better than random chance at distinguishing between short clips of the simulation and the actual game. |
| Researcher Affiliation | Collaboration | Dani Valevski Google Research EMAIL Yaniv Leviathan Google Research EMAIL Tel Aviv University EMAIL Shlomi Fruchter Google Deep Mind EMAIL |
| Pseudocode | No | The paper describes the methods in prose and with diagrams (e.g., Figure 3), but does not include any structured pseudocode or algorithm blocks. |
| Open Source Code | No | The paper states that it uses open-source components (Stable Diffusion 1.4, Vi ZDoom) and provides detailed descriptions of training parameters, but does not provide any statement or link indicating that the code for "Game NGen" itself is open-source or publicly available. |
| Open Datasets | No | Our end goal is to have human players interact with our simulation. To that end, the policy π as in Section 2 is that of human gameplay. Since we cannot sample from that directly at scale, we start by approximating it via teaching an automatic agent to play. ... We record the agent s training trajectories throughout the entire training process, which includes different skill levels of play, starting with a random policy when the agent is untrained. This set of recorded trajectories is our Tagent dataset, used for training the generative model (see Section 3.2). |
| Dataset Splits | No | For training data, we use a random subset of 70M examples from the recorded trajectories played by the agent during RL training and evaluation (see Appendix A.3 for results with smaller datasets). All image frames (during training, inference, and conditioning) are at a resolution of 320x240 padded to 320x256. We use a context length of 64 (i.e. the model is provided its own last 64 predictions as well as the last 64 actions). When evaluated over a random holdout of 2048 trajectories taken in 5 different levels, our model achieves a PSNR of 29.43 and an LPIPS of 0.249. When sampled auto-regressively, the predicted and groundtruth trajectories often diverge after a few steps, mostly due to the accumulation of small amounts of different movement velocities between frames in each trajectory. For that reason, per-frame PSNR and LPIPS values gradually decrease and increase respectively, as can be seen in Figure 6. The predicted trajectory is still similar to the actual game in terms of content and image quality, but per-frame metrics are limited in their ability to capture this (see Appendix A.1 for samples of autoregressively generated trajectories). |
| Hardware Specification | Yes | On our hardware configuration (a single TPU-v5), a single denoiser step and an evaluation of the auto-encoder both takes 10ms. ... We train using 128 TPU-v5e devices with data parallelization. |
| Software Dependencies | Yes | It is trained on CPU using the Stable Baselines 3 infrastructure (Raffin et al., 2021). ... We re-purpose a pre-trained text-to-image diffusion model, Stable Diffusion v1.4 (Rombach et al., 2022) to predict the next frame in the game. |
| Experiment Setup | Yes | We use a batch size of 128 and a constant learning rate of 2e-5, with the Adafactor optimizer without weight decay (Shazeer & Stern, 2018) and gradient clipping of 1.0. The context frames condition is dropped with probability 0.1 to allow CFG during inference. We train using 128 TPU-v5e devices with data parallelization. Unless noted otherwise, all results in the paper are after 700,000 training steps. For noise augmentation (Section 3.2.1), we use a maximal noise level of 0.7, with 10 embedding buckets. We use a batch size of 2,048 for optimizing the latent decoder, other training parameters are identical to those of the denoiser. |