MCU: An Evaluation Framework for Open-Ended Game Agents

Authors: Xinyue Zheng, Haowei Lin, Kaichen He, Zihao Wang, Qiang Fu, Haobo Fu, Zilong Zheng, Yitao Liang

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Empirical results reveal that even state-of-the-art foundation agents struggle with the increasing diversity and complexity of tasks. In this section, we first demonstrate the effectiveness of Auto Eval. Subsequently, we assess the capabilities of state-of-the-art agents using MCU and provide insights for the development of future open-ended Minecraft agents.
Researcher Affiliation Collaboration 1Beijing Institute for General Artificial Intelligence (BIGAI), Beijing, China 2Institute for Artificial Intelligence, Peking University, Beijing, China 3Tencent AI Lab, Shenzhen, China.
Pseudocode Yes G.6. Pseudo-Code Examples Listing 6: Mineflayer Craft Task Pseudo-Code Listing 7: MCU Evaluation Process Pseudo-Code
Open Source Code Yes Our evaluation code and scripts are available at https://github.com/Craft Jarvis/MCU.
Open Datasets Yes Our evaluation code and scripts are available at https://github.com/Craft Jarvis/MCU. The finalized atomic task list is provided as supplementary material alongside the code. Distilling high-quality tasks from existing benchmarks, such as Mine Dojo (Fan et al., 2022) and Skill Forge (Cai et al., 2024c).
Dataset Splits No The paper describes evaluation datasets for human annotation and agent performance across various tasks (e.g., '500 trajectories spanning 60 tasks', '236 pairs', '227 individual ratings'). It also mentions training sets of varying sizes (e.g., '100 to 10,000,000 states' for the Hunt Sheep task). However, it does not specify explicit training/test/validation dataset splits with percentages, sample counts, or a clear methodology for reproducibility of model training in a general sense for the datasets used in its experiments.
Hardware Specification No The paper mentions 'requiring approximately 50 training hours on three GPUs' when discussing experimental setup. However, it does not specify the exact model or type of GPUs used for training.
Software Dependencies No The paper mentions using Mine Studio (Cai et al., 2024a) and implies the use of Mineflayer (Prismarine JS, 2024) through pseudocode examples (Listing 6). It also includes pseudocode with Python-like imports (Listing 7). However, it does not specify version numbers for these or any other key software dependencies or libraries.
Experiment Setup Yes Detailed hyperparameter configurations are listed in Table 9. Table 9 specifies: Steps 25M, GAE Lambda 0.95, Learning Rate 2 * 10^-5, PPO Clip 0.1, Scheduler Linear, Policy Loss Weight 1.0, Optimizer Adam, Value Loss Weight 0.5, Adam Epsilon 1 * 10^-8, KL Loss Weight 0.3, Number of Training GPUs 2, KL Loss Decay 0.995, Batch Size per GPU 1, Reward Discount 0.999.