reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

AdaWorld: Learning Adaptable World Models with Latent Actions

Authors: Shenyuan Gao, Siyuan Zhou, Yilun Du, Jun Zhang, Chuang Gan

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our comprehensive experiments across multiple environments demonstrate that Ada World achieves superior performance in both simulation quality and visual planning. In this section, we first demonstrate Ada World s strengths in action transfer in Sec. 3.1. We then study how efficient world model adaptation enables better simulation and planning in Sec. 3.2. Lastly, we analyze the effectiveness of our designs with ablation studies in Sec. 3.3. To thoroughly understand the adaptability of our approach, we compare Ada World with three representative baselines
Researcher Affiliation	Academia	Shenyuan Gao 1 Siyuan Zhou 1 Yilun Du 2 Jun Zhang 1 Chuang Gan 3 4 1HKUST 2Harvard 3UMass Amherst 4MIT-IBM Watson AI Lab. Primary contact to Shenyuan Gao <EMAIL>.
Pseudocode	No	The paper describes the methodology, training procedures, and planning algorithms in narrative text and equations (e.g., Appendix B.4 for visual planning process), but does not include any distinct pseudocode blocks or algorithm listings.
Open Source Code	No	The paper mentions "adaptable-world-model.github.io" at the beginning, which appears to be a project homepage. However, it does not explicitly state that the source code for the methodology described in the paper is provided at this link, nor does it provide a direct link to a code repository. The text "Visit our project page to see planning demonstrations of agents in games." in Appendix C implies it's for demonstrations rather than code.
Open Datasets	Yes	Our training dataset comprises four publicly accessible datasets (Goyal et al., 2017; Grauman et al., 2022; O Neill et al., 2024; Ju et al., 2024) and videos collected automatically from 1016 environments in Gym Retro (Nichol et al., 2018) and Procgen Benchmark (Cobbe et al., 2020). To quantitatively compare with other baselines, we construct a evaluation set sourced from the unseen LIBERO (Liu et al., 2023) and Something-Something v2 (SSv2) (Goyal et al., 2017) datasets.
Dataset Splits	Yes	To quantitatively compare with other baselines, we construct a evaluation set sourced from the unseen LIBERO (Liu et al., 2023) and Something-Something v2 (SSv2) (Goyal et al., 2017) datasets. Specifically, we select and pair videos from the same tasks in LIBERO and the same labels among the top-10 most frequent labels in SSv2, resulting in 1300 pairs for evaluation (more details in Appendix D). Each environment has a validation set consisting of 300 samples, which is used to evaluate the adaption quality in terms of PSNR (Hore & Ziou, 2010) and LPIPS (Zhang et al., 2018). To demonstrate the adaptability with restricted labels, we collect only 100 samples for each action in every discrete environment and 100 trajectories for nu Scenes. For the 16 environments in Procgen (Cobbe et al., 2020), we hold out 1000 start levels for evaluation and use the remaining 9000 levels for training.
Hardware Specification	Yes	The autoregressive world model is trained for 80K steps with a batch size of 64 and a learning rate of 5 10 5 on 16 NVIDIA A100 GPUs.
Software Dependencies	No	The paper references various models and frameworks like Transformer architecture, VAE, beta-VAE, Stable Video Diffusion (SVD), Uni Match, VQ-VAE, Adam W optimizer, and i Video GPT, along with their corresponding research papers. However, it does not specify explicit version numbers for any programming languages, libraries, or software environments used for implementation (e.g., Python version, PyTorch version).
Experiment Setup	Yes	The latent action autoencoder is trained for 200K steps from scratch with a batch size of 960. We employ the Adam W optimizer (Loshchilov & Hutter, 2019) with a learning rate of 2.5 10 5 and a weight decay of 0.01. The hyperparameter β is set to 2 10 4 to achieve a good balance between representation capacity and context disentangling ability. The autoregressive world model is trained for 80K steps with a batch size of 64 and a learning rate of 5 10 5 on 16 NVIDIA A100 GPUs. We adopt a cosine learning rate scheduler with 10K warmup steps. For both latent action autoencoder training and world model pretraining, we randomly jitter the brightness of input frames to augment generalization ability. To demonstrate the adaptability with restricted labels, we collect only 100 samples for each action in every discrete environment and 100 trajectories for nu Scenes. Using the limited interaction data, we then finetune all compared world models for 800 steps with a batch size of 32 and a learning rate of 5 10 5. In practice, we use i = 2 Cross-Entropy Method iterations. For each iteration, N = 100 action sequences with a length of L = 15 are sampled, and the best K = 10 samples are selected to update the action sampling distribution. After the optimization procedure is done, the first T = 5 actions are executed in the environment. We set the search limit to 20 steps. For efficiency, we use only 3 denoising steps and disable classifier-free guidance during planning.