reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

MrSteve: Instruction-Following Agents in Minecraft with What-Where-When Memory

Authors: Junyeong Park, Junmo Cho, Sungjin Ahn

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	This section presents a step-by-step validation of our agent Mr Steve across various environments and conditions. We begin by evaluating the exploration and navigation ability of Mr Steve, which is crucial in sparse sequential tasks (Section 4.1). Then, we demonstrate Mr Steve s capability to solve A-B-A task sequentially where the memory is necessary to solve the task A twice (Section 4.2). Additionally, we show that the proposed Place Event Memory outperforms other memory variants, particularly when memory capacity is limited (Section 4.3). Lastly, we showcase the generalization of Mr Steve to long-horizon sparse sequential task (Section 4.4). Each baseline and task is explained in each of the experiment sections with more details in Appendix C.
Researcher Affiliation	Academia	Junyeong Park1 , Junmo Cho1 , Sungjin Ahn1,2 1KAIST & 2New York University
Pseudocode	Yes	Algorithm 1 Mr Steve Single Loop Require: Memory Mt, and task τn 1: candidates Read(Mt, τn) 2: if candidates = then 3: Xt, lt = One Of(candidates) 4: Navigate to lt with πL-Nav 5: Execute τn with πInst 6: else 7: Explore with πH-Cnt, πL-Nav 8: end if
Open Source Code	No	We will release our code and demos on the project page: https://sites.google.com/view/mr-steve.
Open Datasets	Yes	Minecraft has become a leading testbed, offering a demanding, open-ended environment with rich interaction possibilities. Its procedurally generated world presents agents with challenges like exploration, resource management, tool crafting, and survival, all requiring advanced decision-making and long-horizon planning. For instance, the task of obtaining a diamond requires agents to locate diamond ore , and craft an iron pickaxe . This process involves finding, mining, and refining iron ore , requiring the agent to execute detailed long-term planning over roughly 24,000 environmental steps (Li et al., 2024). All tasks are implemented using Mine Dojo (Fan et al., 2022b).
Dataset Splits	No	The paper describes various experimental tasks and phases (e.g., 'exploration phase', 'task phase', 'ABA-Sparse task') and mentions training models like VPT-Nav, but it does not specify any dataset splits (e.g., train/test/validation percentages or counts) for any underlying dataset used in these tasks or model training.
Hardware Specification	Yes	Our study was performed on an Intel server equipped with 8 NVIDIA RTX 4090 GPUs and 512GB of memory.
Software Dependencies	No	We used PPO (Schulman et al., 2017) for fine-tuning goal encoder Gψ, Lo RA parameters, policy πψ, and value vψ with reward based on the distance to the goal location. All tasks are implemented using Mine Dojo (Fan et al., 2022b).
Experiment Setup	Yes	Table 5: Hyper-parameters for the Goal-Conditioned Navigation VPT Training. Initial VPT Model rl-from-foundation-2x Discount Factor 0.999 Rollout Buffer Size 40 Training Epochs per Iteration 5 Vectorized Environments 4 Learning Rate 10^-4 KL Loss Coefficient 10^-4 KL Loss Coefficient Decay 0.999 Total Iteration 400K Steps per Iteration 500 GAE Lambda 0.95 Clip Range 0.2