MrSteve: Instruction-Following Agents in Minecraft with What-Where-When Memory
Authors: Junyeong Park, Junmo Cho, Sungjin Ahn
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | This section presents a step-by-step validation of our agent Mr Steve across various environments and conditions. We begin by evaluating the exploration and navigation ability of Mr Steve, which is crucial in sparse sequential tasks (Section 4.1). Then, we demonstrate Mr Steve s capability to solve A-B-A task sequentially where the memory is necessary to solve the task A twice (Section 4.2). Additionally, we show that the proposed Place Event Memory outperforms other memory variants, particularly when memory capacity is limited (Section 4.3). Lastly, we showcase the generalization of Mr Steve to long-horizon sparse sequential task (Section 4.4). Each baseline and task is explained in each of the experiment sections with more details in Appendix C. |
| Researcher Affiliation | Academia | Junyeong Park1 , Junmo Cho1 , Sungjin Ahn1,2 1KAIST & 2New York University |
| Pseudocode | Yes | Algorithm 1 Mr Steve Single Loop Require: Memory Mt, and task τn 1: candidates Read(Mt, τn) 2: if candidates = then 3: Xt, lt = One Of(candidates) 4: Navigate to lt with πL-Nav 5: Execute τn with πInst 6: else 7: Explore with πH-Cnt, πL-Nav 8: end if |
| Open Source Code | No | We will release our code and demos on the project page: https://sites.google.com/view/mr-steve. |
| Open Datasets | Yes | Minecraft has become a leading testbed, offering a demanding, open-ended environment with rich interaction possibilities. Its procedurally generated world presents agents with challenges like exploration, resource management, tool crafting, and survival, all requiring advanced decision-making and long-horizon planning. For instance, the task of obtaining a diamond requires agents to locate diamond ore , and craft an iron pickaxe . This process involves finding, mining, and refining iron ore , requiring the agent to execute detailed long-term planning over roughly 24,000 environmental steps (Li et al., 2024). All tasks are implemented using Mine Dojo (Fan et al., 2022b). |
| Dataset Splits | No | The paper describes various experimental tasks and phases (e.g., 'exploration phase', 'task phase', 'ABA-Sparse task') and mentions training models like VPT-Nav, but it does not specify any dataset splits (e.g., train/test/validation percentages or counts) for any underlying dataset used in these tasks or model training. |
| Hardware Specification | Yes | Our study was performed on an Intel server equipped with 8 NVIDIA RTX 4090 GPUs and 512GB of memory. |
| Software Dependencies | No | We used PPO (Schulman et al., 2017) for fine-tuning goal encoder Gψ, Lo RA parameters, policy πψ, and value vψ with reward based on the distance to the goal location. All tasks are implemented using Mine Dojo (Fan et al., 2022b). |
| Experiment Setup | Yes | Table 5: Hyper-parameters for the Goal-Conditioned Navigation VPT Training. Initial VPT Model rl-from-foundation-2x Discount Factor 0.999 Rollout Buffer Size 40 Training Epochs per Iteration 5 Vectorized Environments 4 Learning Rate 10^-4 KL Loss Coefficient 10^-4 KL Loss Coefficient Decay 0.999 Total Iteration 400K Steps per Iteration 500 GAE Lambda 0.95 Clip Range 0.2 |