reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Odyssey : Empowering Minecraft Agents with Open-World Skills

Authors: Shunyu Liu, Yaoru Li, Kongcheng Zhang, Zhenyu Cui, Wenkai Fang, Yuxuan Zheng, Tongya Zheng, Mingli Song

IJCAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments demonstrate that the proposed ODYSSEY framework can effectively evaluate different capabilities of LLM-based agents. All datasets, model weights, and code are publicly available to motivate future research on more advanced autonomous agent solutions. [...] To demonstrate the effectiveness of the proposed ODYSSEY framework, we conduct experiments on basic programmatic tasks and the agent capability benchmark.
Researcher Affiliation	Academia	1Zhejiang University 2Zhejiang Provincial Engineering Research Center for Real-Time Smart Tech in Urban Security Governance, School of Computer and Computing Science, Hangzhou City University 3State Key Laboratory of Blockchain and Data Security, Zhejiang University 4Hangzhou High-Tech Zone (Binjiang) Institute of Blockchain and Data Security EMAIL, EMAIL, EMAIL
Pseudocode	No	The paper describes a planner-actor-critic architecture and a recursive method for skill execution, with full skill and prompt details mentioned as being in Appendix C. However, it does not contain a clearly labeled 'Pseudocode' or 'Algorithm' block in the main text.
Open Source Code	Yes	All datasets, model weights, and code are publicly available to motivate future research on more advanced autonomous agent solutions.
Open Datasets	Yes	All datasets, model weights, and code are publicly available to motivate future research on more advanced autonomous agent solutions. [...] We fine-tune the LLa MA-3 model [Touvron et al., 2023] using a large-scale Question-Answering (Q&A) dataset with 390k+ instruction entries sourced from the Minecraft Wiki.
Dataset Splits	No	The paper mentions generating a 'large-scale training dataset with 390k+ instruction entries from Minecraft Wikis' and evaluating with a 'custom multiple-choice dataset' and performing '120 repeated experiments' on tasks. However, it does not explicitly provide specific training/test/validation dataset splits (e.g., percentages, sample counts, or references to predefined splits) for these datasets.
Hardware Specification	No	The paper mentions 'The advanced computing resources provided by the Supercomputing Center of Hangzhou City University' in the acknowledgments. However, it does not provide specific details about the hardware used for experiments, such as GPU/CPU models, memory, or specific cloud instance types.
Software Dependencies	No	The paper mentions several tools and models like 'Mineflayer Java Script APIs [Prismarine JS, 2023]', 'Sentence Transformer [Reimers and Gurevych, 2019]', 'LLa MA-3 model [Touvron et al., 2023]', 'Lo RA [Hu et al., 2021]', 'GPT-3.5-Turbo [Open AI, 2023]', and 'GPT-4 [Achiam et al., 2023]'. However, it does not specify the version numbers for the software or libraries used in their implementation.
Experiment Setup	Yes	We conduct experiments on basic programmatic tasks and the agent capability benchmark. Our simulation environment is built on top of Voyager [Wang et al., 2023a]. We conducted 120 repeated experiments on each task and recorded the average completion time for each task as well as the success rates at different time points. We compared the performance of our agent with both the fine-tuned Mine MA-8B and the original LLa MA-3-8B, and also the performance of Voyager [Wang et al., 2023a] with GPT-4o-mini across these tasks.