reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

World Model Implanting for Test-time Adaptation of Embodied Agents

Authors: Minjong Yoo, Jinwoo Jang, Sihyung Yoon, Honguk Woo

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate our Wor MI on the Virtual Home and ALFWorld benchmarks, demonstrating superior zero-shot and few-shot performance compared to several LLM-based approaches across a range of unseen domains. These results highlight the framework s potential for scalable, real-world deployment in embodied agent scenarios where adaptability and data efficiency are essential.
Researcher Affiliation	Academia	1Department of Computer Science and Engineering, Sungkyunkwan University. Correspondence to: Honguk Woo <EMAIL>.
Pseudocode	Yes	Algorithm 1 Wor MI Framework
Open Source Code	No	The paper refers to opensource links for baselines (e.g., "For implementation, we refer to the opensource 1."), but does not provide an explicit statement or link for the code of the Wor MI framework itself.
Open Datasets	Yes	Through experiments with Virtual Home (Puig et al., 2018), and ALFWorld (Shridhar et al., 2021), we demonstrate that the Wor MI framework achieves competitive performance in both effectiveness and efficiency compared to several state-of-the-art LLM-based embodied agents.
Dataset Splits	No	For Virtual Home, we collect 1,023 episodes, covering 78 tasks (16 seen, 62 unseen) across 20 distinct scenes (6 seen, 14 unseen), each featuring unique room layouts and objects. For ALFWorld, we collect 3,554 episodes across diverse scenes. Following CL-ALFRED benchmark settings (Kim et al., 2024), the data is clustered into 4 scene types (3 seen, 1 unseen) and 6 task types (4 seen, 2 unseen).
Hardware Specification	No	The paper discusses inference time and memory usage for different LLM models (e.g., LLa MA-3.2-11B, LLa MA-3.2-1B) in Table A.8, but it does not specify the actual hardware (e.g., GPU or CPU model) on which these measurements were performed or the experiments were run.
Software Dependencies	Yes	In our implementation, we use a fixed Llama-3.23B (AI@Meta, 2024) model for ZSP, LLM-Planner, the Say model in Say Can Pay, and the reasoning model in Wor MI. For LLM+FT, the Pay model in Say Can Pay, and the world models in Wor MI, we use a trainable Llama-3.2-1B model.
Experiment Setup	Yes	Table A.5: Hyperparameter settings and configurations of baselines Hyperparameter Value Trainable model (LLM+FT, and Pay model in Say Cay Pay) Llama-3.2-1B Reasoning model (ZSP, LLM-Planner, and Say model in Say Cay Pay) Llama-3.2-3B Batch size 4 Gradient steps 200 Learning rate scheduler cosine Initial learning rate 5 10 5 Learning rate (for few-shot learning) 1 10 6 Temperature (both of Llama-3.2-1B and Llama-3.2-3B) 1.0 Table A.6: Hyperparameter settings and configurations of Wor MI Parameter Value Learning world models M1, ..., MN Base model Llama-3.2-1B Batch Size 4 Gradient steps 2000 Learning Rate Scheduler cosine Learning Rate 3 10 5 Temperature 1.0 Intermediate connection layer [13, 27] Learning compound attention Reasoning model πR Llama-3.2-3B Batch Size 4 Meta update steps λM 8 Inner-loop gradient steps λI 30 Learning Rate Scheduler cosine Learning Rate α 1 10 5 Meta learning Rate β 1 10 1 Temperature 1.0 Learning Rate (for few-shot learning) 1 10 5 Intermediate connection layer [13, 27] for reasoning model [7,15] for world models Prototype-based world model retrieval The number of embeddings in prototype k 15 The number of world models N 6 The number of retrieved world models K 3