World Model Implanting for Test-time Adaptation of Embodied Agents
Authors: Minjong Yoo, Jinwoo Jang, Sihyung Yoon, Honguk Woo
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate our Wor MI on the Virtual Home and ALFWorld benchmarks, demonstrating superior zero-shot and few-shot performance compared to several LLM-based approaches across a range of unseen domains. These results highlight the framework s potential for scalable, real-world deployment in embodied agent scenarios where adaptability and data efficiency are essential. |
| Researcher Affiliation | Academia | 1Department of Computer Science and Engineering, Sungkyunkwan University. Correspondence to: Honguk Woo <EMAIL>. |
| Pseudocode | Yes | Algorithm 1 Wor MI Framework |
| Open Source Code | No | The paper refers to opensource links for baselines (e.g., "For implementation, we refer to the opensource 1."), but does not provide an explicit statement or link for the code of the Wor MI framework itself. |
| Open Datasets | Yes | Through experiments with Virtual Home (Puig et al., 2018), and ALFWorld (Shridhar et al., 2021), we demonstrate that the Wor MI framework achieves competitive performance in both effectiveness and efficiency compared to several state-of-the-art LLM-based embodied agents. |
| Dataset Splits | No | For Virtual Home, we collect 1,023 episodes, covering 78 tasks (16 seen, 62 unseen) across 20 distinct scenes (6 seen, 14 unseen), each featuring unique room layouts and objects. For ALFWorld, we collect 3,554 episodes across diverse scenes. Following CL-ALFRED benchmark settings (Kim et al., 2024), the data is clustered into 4 scene types (3 seen, 1 unseen) and 6 task types (4 seen, 2 unseen). |
| Hardware Specification | No | The paper discusses inference time and memory usage for different LLM models (e.g., LLa MA-3.2-11B, LLa MA-3.2-1B) in Table A.8, but it does not specify the actual hardware (e.g., GPU or CPU model) on which these measurements were performed or the experiments were run. |
| Software Dependencies | Yes | In our implementation, we use a fixed Llama-3.23B (AI@Meta, 2024) model for ZSP, LLM-Planner, the Say model in Say Can Pay, and the reasoning model in Wor MI. For LLM+FT, the Pay model in Say Can Pay, and the world models in Wor MI, we use a trainable Llama-3.2-1B model. |
| Experiment Setup | Yes | Table A.5: Hyperparameter settings and configurations of baselines Hyperparameter Value Trainable model (LLM+FT, and Pay model in Say Cay Pay) Llama-3.2-1B Reasoning model (ZSP, LLM-Planner, and Say model in Say Cay Pay) Llama-3.2-3B Batch size 4 Gradient steps 200 Learning rate scheduler cosine Initial learning rate 5 10 5 Learning rate (for few-shot learning) 1 10 6 Temperature (both of Llama-3.2-1B and Llama-3.2-3B) 1.0 Table A.6: Hyperparameter settings and configurations of Wor MI Parameter Value Learning world models M1, ..., MN Base model Llama-3.2-1B Batch Size 4 Gradient steps 2000 Learning Rate Scheduler cosine Learning Rate 3 10 5 Temperature 1.0 Intermediate connection layer [13, 27] Learning compound attention Reasoning model πR Llama-3.2-3B Batch Size 4 Meta update steps λM 8 Inner-loop gradient steps λI 30 Learning Rate Scheduler cosine Learning Rate α 1 10 5 Meta learning Rate β 1 10 1 Temperature 1.0 Learning Rate (for few-shot learning) 1 10 5 Intermediate connection layer [13, 27] for reasoning model [7,15] for world models Prototype-based world model retrieval The number of embeddings in prototype k 15 The number of world models N 6 The number of retrieved world models K 3 |