Odyssey : Empowering Minecraft Agents with Open-World Skills

Authors: Shunyu Liu, Yaoru Li, Kongcheng Zhang, Zhenyu Cui, Wenkai Fang, Yuxuan Zheng, Tongya Zheng, Mingli Song

IJCAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments demonstrate that the proposed ODYSSEY framework can effectively evaluate different capabilities of LLM-based agents. All datasets, model weights, and code are publicly available to motivate future research on more advanced autonomous agent solutions. [...] To demonstrate the effectiveness of the proposed ODYSSEY framework, we conduct experiments on basic programmatic tasks and the agent capability benchmark.
Researcher Affiliation Academia 1Zhejiang University 2Zhejiang Provincial Engineering Research Center for Real-Time Smart Tech in Urban Security Governance, School of Computer and Computing Science, Hangzhou City University 3State Key Laboratory of Blockchain and Data Security, Zhejiang University 4Hangzhou High-Tech Zone (Binjiang) Institute of Blockchain and Data Security EMAIL, EMAIL, EMAIL
Pseudocode No The paper describes a planner-actor-critic architecture and a recursive method for skill execution, with full skill and prompt details mentioned as being in Appendix C. However, it does not contain a clearly labeled 'Pseudocode' or 'Algorithm' block in the main text.
Open Source Code Yes All datasets, model weights, and code are publicly available to motivate future research on more advanced autonomous agent solutions.
Open Datasets Yes All datasets, model weights, and code are publicly available to motivate future research on more advanced autonomous agent solutions. [...] We fine-tune the LLa MA-3 model [Touvron et al., 2023] using a large-scale Question-Answering (Q&A) dataset with 390k+ instruction entries sourced from the Minecraft Wiki.
Dataset Splits No The paper mentions generating a 'large-scale training dataset with 390k+ instruction entries from Minecraft Wikis' and evaluating with a 'custom multiple-choice dataset' and performing '120 repeated experiments' on tasks. However, it does not explicitly provide specific training/test/validation dataset splits (e.g., percentages, sample counts, or references to predefined splits) for these datasets.
Hardware Specification No The paper mentions 'The advanced computing resources provided by the Supercomputing Center of Hangzhou City University' in the acknowledgments. However, it does not provide specific details about the hardware used for experiments, such as GPU/CPU models, memory, or specific cloud instance types.
Software Dependencies No The paper mentions several tools and models like 'Mineflayer Java Script APIs [Prismarine JS, 2023]', 'Sentence Transformer [Reimers and Gurevych, 2019]', 'LLa MA-3 model [Touvron et al., 2023]', 'Lo RA [Hu et al., 2021]', 'GPT-3.5-Turbo [Open AI, 2023]', and 'GPT-4 [Achiam et al., 2023]'. However, it does not specify the version numbers for the software or libraries used in their implementation.
Experiment Setup Yes We conduct experiments on basic programmatic tasks and the agent capability benchmark. Our simulation environment is built on top of Voyager [Wang et al., 2023a]. We conducted 120 repeated experiments on each task and recorded the average completion time for each task as well as the success rates at different time points. We compared the performance of our agent with both the fine-tuned Mine MA-8B and the original LLa MA-3-8B, and also the performance of Voyager [Wang et al., 2023a] with GPT-4o-mini across these tasks.