MCU: An Evaluation Framework for Open-Ended Game Agents
Authors: Xinyue Zheng, Haowei Lin, Kaichen He, Zihao Wang, Qiang Fu, Haobo Fu, Zilong Zheng, Yitao Liang
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Empirical results reveal that even state-of-the-art foundation agents struggle with the increasing diversity and complexity of tasks. In this section, we first demonstrate the effectiveness of Auto Eval. Subsequently, we assess the capabilities of state-of-the-art agents using MCU and provide insights for the development of future open-ended Minecraft agents. |
| Researcher Affiliation | Collaboration | 1Beijing Institute for General Artificial Intelligence (BIGAI), Beijing, China 2Institute for Artificial Intelligence, Peking University, Beijing, China 3Tencent AI Lab, Shenzhen, China. |
| Pseudocode | Yes | G.6. Pseudo-Code Examples Listing 6: Mineflayer Craft Task Pseudo-Code Listing 7: MCU Evaluation Process Pseudo-Code |
| Open Source Code | Yes | Our evaluation code and scripts are available at https://github.com/Craft Jarvis/MCU. |
| Open Datasets | Yes | Our evaluation code and scripts are available at https://github.com/Craft Jarvis/MCU. The finalized atomic task list is provided as supplementary material alongside the code. Distilling high-quality tasks from existing benchmarks, such as Mine Dojo (Fan et al., 2022) and Skill Forge (Cai et al., 2024c). |
| Dataset Splits | No | The paper describes evaluation datasets for human annotation and agent performance across various tasks (e.g., '500 trajectories spanning 60 tasks', '236 pairs', '227 individual ratings'). It also mentions training sets of varying sizes (e.g., '100 to 10,000,000 states' for the Hunt Sheep task). However, it does not specify explicit training/test/validation dataset splits with percentages, sample counts, or a clear methodology for reproducibility of model training in a general sense for the datasets used in its experiments. |
| Hardware Specification | No | The paper mentions 'requiring approximately 50 training hours on three GPUs' when discussing experimental setup. However, it does not specify the exact model or type of GPUs used for training. |
| Software Dependencies | No | The paper mentions using Mine Studio (Cai et al., 2024a) and implies the use of Mineflayer (Prismarine JS, 2024) through pseudocode examples (Listing 6). It also includes pseudocode with Python-like imports (Listing 7). However, it does not specify version numbers for these or any other key software dependencies or libraries. |
| Experiment Setup | Yes | Detailed hyperparameter configurations are listed in Table 9. Table 9 specifies: Steps 25M, GAE Lambda 0.95, Learning Rate 2 * 10^-5, PPO Clip 0.1, Scheduler Linear, Policy Loss Weight 1.0, Optimizer Adam, Value Loss Weight 0.5, Adam Epsilon 1 * 10^-8, KL Loss Weight 0.3, Number of Training GPUs 2, KL Loss Decay 0.995, Batch Size per GPU 1, Reward Discount 0.999. |