AdaWorld: Learning Adaptable World Models with Latent Actions

Authors: Shenyuan Gao, Siyuan Zhou, Yilun Du, Jun Zhang, Chuang Gan

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our comprehensive experiments across multiple environments demonstrate that Ada World achieves superior performance in both simulation quality and visual planning. In this section, we first demonstrate Ada World s strengths in action transfer in Sec. 3.1. We then study how efficient world model adaptation enables better simulation and planning in Sec. 3.2. Lastly, we analyze the effectiveness of our designs with ablation studies in Sec. 3.3. To thoroughly understand the adaptability of our approach, we compare Ada World with three representative baselines
Researcher Affiliation Academia Shenyuan Gao 1 Siyuan Zhou 1 Yilun Du 2 Jun Zhang 1 Chuang Gan 3 4 1HKUST 2Harvard 3UMass Amherst 4MIT-IBM Watson AI Lab. Primary contact to Shenyuan Gao <EMAIL>.
Pseudocode No The paper describes the methodology, training procedures, and planning algorithms in narrative text and equations (e.g., Appendix B.4 for visual planning process), but does not include any distinct pseudocode blocks or algorithm listings.
Open Source Code No The paper mentions "adaptable-world-model.github.io" at the beginning, which appears to be a project homepage. However, it does not explicitly state that the source code for the methodology described in the paper is provided at this link, nor does it provide a direct link to a code repository. The text "Visit our project page to see planning demonstrations of agents in games." in Appendix C implies it's for demonstrations rather than code.
Open Datasets Yes Our training dataset comprises four publicly accessible datasets (Goyal et al., 2017; Grauman et al., 2022; O Neill et al., 2024; Ju et al., 2024) and videos collected automatically from 1016 environments in Gym Retro (Nichol et al., 2018) and Procgen Benchmark (Cobbe et al., 2020). To quantitatively compare with other baselines, we construct a evaluation set sourced from the unseen LIBERO (Liu et al., 2023) and Something-Something v2 (SSv2) (Goyal et al., 2017) datasets.
Dataset Splits Yes To quantitatively compare with other baselines, we construct a evaluation set sourced from the unseen LIBERO (Liu et al., 2023) and Something-Something v2 (SSv2) (Goyal et al., 2017) datasets. Specifically, we select and pair videos from the same tasks in LIBERO and the same labels among the top-10 most frequent labels in SSv2, resulting in 1300 pairs for evaluation (more details in Appendix D). Each environment has a validation set consisting of 300 samples, which is used to evaluate the adaption quality in terms of PSNR (Hore & Ziou, 2010) and LPIPS (Zhang et al., 2018). To demonstrate the adaptability with restricted labels, we collect only 100 samples for each action in every discrete environment and 100 trajectories for nu Scenes. For the 16 environments in Procgen (Cobbe et al., 2020), we hold out 1000 start levels for evaluation and use the remaining 9000 levels for training.
Hardware Specification Yes The autoregressive world model is trained for 80K steps with a batch size of 64 and a learning rate of 5 10 5 on 16 NVIDIA A100 GPUs.
Software Dependencies No The paper references various models and frameworks like Transformer architecture, VAE, beta-VAE, Stable Video Diffusion (SVD), Uni Match, VQ-VAE, Adam W optimizer, and i Video GPT, along with their corresponding research papers. However, it does not specify explicit version numbers for any programming languages, libraries, or software environments used for implementation (e.g., Python version, PyTorch version).
Experiment Setup Yes The latent action autoencoder is trained for 200K steps from scratch with a batch size of 960. We employ the Adam W optimizer (Loshchilov & Hutter, 2019) with a learning rate of 2.5 10 5 and a weight decay of 0.01. The hyperparameter β is set to 2 10 4 to achieve a good balance between representation capacity and context disentangling ability. The autoregressive world model is trained for 80K steps with a batch size of 64 and a learning rate of 5 10 5 on 16 NVIDIA A100 GPUs. We adopt a cosine learning rate scheduler with 10K warmup steps. For both latent action autoencoder training and world model pretraining, we randomly jitter the brightness of input frames to augment generalization ability. To demonstrate the adaptability with restricted labels, we collect only 100 samples for each action in every discrete environment and 100 trajectories for nu Scenes. Using the limited interaction data, we then finetune all compared world models for 800 steps with a batch size of 32 and a learning rate of 5 10 5. In practice, we use i = 2 Cross-Entropy Method iterations. For each iteration, N = 100 action sequences with a length of L = 15 are sampled, and the best K = 10 samples are selected to update the action sampling distribution. After the optimization procedure is done, the first T = 5 actions are executed in the environment. We set the search limit to 20 steps. For efficiency, we use only 3 denoising steps and disable classifier-free guidance during planning.