reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

MCU: An Evaluation Framework for Open-Ended Game Agents

Authors: Xinyue Zheng, Haowei Lin, Kaichen He, Zihao Wang, Qiang Fu, Haobo Fu, Zilong Zheng, Yitao Liang

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Empirical results reveal that even state-of-the-art foundation agents struggle with the increasing diversity and complexity of tasks. In this section, we first demonstrate the effectiveness of Auto Eval. Subsequently, we assess the capabilities of state-of-the-art agents using MCU and provide insights for the development of future open-ended Minecraft agents.
Researcher Affiliation	Collaboration	1Beijing Institute for General Artificial Intelligence (BIGAI), Beijing, China 2Institute for Artificial Intelligence, Peking University, Beijing, China 3Tencent AI Lab, Shenzhen, China.
Pseudocode	Yes	G.6. Pseudo-Code Examples Listing 6: Mineflayer Craft Task Pseudo-Code Listing 7: MCU Evaluation Process Pseudo-Code
Open Source Code	Yes	Our evaluation code and scripts are available at https://github.com/Craft Jarvis/MCU.
Open Datasets	Yes	Our evaluation code and scripts are available at https://github.com/Craft Jarvis/MCU. The finalized atomic task list is provided as supplementary material alongside the code. Distilling high-quality tasks from existing benchmarks, such as Mine Dojo (Fan et al., 2022) and Skill Forge (Cai et al., 2024c).
Dataset Splits	No	The paper describes evaluation datasets for human annotation and agent performance across various tasks (e.g., '500 trajectories spanning 60 tasks', '236 pairs', '227 individual ratings'). It also mentions training sets of varying sizes (e.g., '100 to 10,000,000 states' for the Hunt Sheep task). However, it does not specify explicit training/test/validation dataset splits with percentages, sample counts, or a clear methodology for reproducibility of model training in a general sense for the datasets used in its experiments.
Hardware Specification	No	The paper mentions 'requiring approximately 50 training hours on three GPUs' when discussing experimental setup. However, it does not specify the exact model or type of GPUs used for training.
Software Dependencies	No	The paper mentions using Mine Studio (Cai et al., 2024a) and implies the use of Mineflayer (Prismarine JS, 2024) through pseudocode examples (Listing 6). It also includes pseudocode with Python-like imports (Listing 7). However, it does not specify version numbers for these or any other key software dependencies or libraries.
Experiment Setup	Yes	Detailed hyperparameter configurations are listed in Table 9. Table 9 specifies: Steps 25M, GAE Lambda 0.95, Learning Rate 2 * 10^-5, PPO Clip 0.1, Scheduler Linear, Policy Loss Weight 1.0, Optimizer Adam, Value Loss Weight 0.5, Adam Epsilon 1 * 10^-8, KL Loss Weight 0.3, Number of Training GPUs 2, KL Loss Decay 0.995, Batch Size per GPU 1, Reward Discount 0.999.