reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Voyager: An Open-Ended Embodied Agent with Large Language Models

Authors: Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Mandlekar, Chaowei Xiao, Yuke Zhu, Linxi Fan, Anima Anandkumar

TMLR 2024 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Empirically, Voyager shows strong in-context lifelong learning capability and exhibits exceptional proficiency in playing Minecraft. It obtains 3.3 more unique items, travels 2.3 longer distances, and unlocks key tech tree milestones up to 15.3 faster than prior SOTA. Voyager is able to utilize the learned skill library in a new Minecraft world to solve novel tasks from scratch, while other techniques struggle to generalize. The paper also includes a dedicated 'Experiments' section (Section 3) with subsections for 'Experimental Setup', 'Baselines', and 'Evaluation Results', and 'Ablation Studies', all indicating empirical validation.
Researcher Affiliation	Collaboration	1NVIDIA, 2Caltech, 3UT Austin, 4Stanford, 5UW Madison. The affiliations include both NVIDIA (an industry entity) and universities (Caltech, UT Austin, Stanford, UW Madison), indicating a collaboration between industry and academia.
Pseudocode	Yes	The pseudocode of Voyager algorithm is shown in Pseudocode 1.
Open Source Code	No	The paper mentions 'Mine Dojo (Fan et al., 2022), an open-source Minecraft AI framework' and 'https://voyager.minedojo.org'. However, it does not provide an explicit statement from the authors that their own code for Voyager is released, nor a direct link to a code repository for their implementation. The provided URL is a project page, not a specific code repository.
Open Datasets	Yes	We evaluate Voyager systematically against other LLM-based agent techniques (e.g., Re Act (Yao et al., 2022), Reflexion (Shinn et al., 2023), Auto GPT (Richards, 2023)) in Mine Dojo (Fan et al., 2022), an open-source Minecraft AI framework. Mine Dojo is cited as 'Mine Dojo: Building open-ended embodied agents with internet-scale knowledge.' (Fan et al., 2022). Other datasets like 'Minerl: A large-scale dataset of minecraft demonstrations.' (Guss et al., 2019b) are also cited.
Dataset Splits	No	The paper mentions experimental runs like 'We run three trials for each method.' and scenarios like 'To evaluate zero-shot generalization, we clear the agent s inventory, reset it to a newly instantiated world, and test it with unseen tasks.' but does not specify any conventional training/test/validation dataset splits (e.g., percentages, sample counts, or predefined splits) for reproducing data partitioning.
Hardware Specification	No	The paper states, 'We leverage Open AI s gpt-4-0314 (Open AI, 2023) and gpt-3.5-turbo-0301 (chatgpt) APIs for text completion, along with text-embedding-ada-002 (embedding) API for text embedding.' This specifies the APIs used, which are cloud-based services, but does not provide details about the specific hardware (e.g., GPU models, CPU types) on which the experiments were run or the Minecraft simulation was executed.
Software Dependencies	Yes	We leverage Open AI s gpt-4-0314 (Open AI, 2023) and gpt-3.5-turbo-0301 (chatgpt) APIs for text completion, along with text-embedding-ada-002 (embedding) API for text embedding. Our simulation environment is built on top of Mine Dojo (Fan et al., 2022) and utilizes Mineflayer (Prismarine JS, 2013) Java Script APIs for motor controls.
Experiment Setup	Yes	We set all temperatures to 0 except for the automatic curriculum, which uses temperature = 0.1 to encourage task diversity. If the bot dies, it is resurrected near the closest ground, and its inventory is preserved for uninterrupted exploration. The bot recycles its crafting table and furnace after program execution. Appendix A.2.3 and Table A.1 also detail a 'Warm-up schedule' for incorporating information into prompts, specifying the number of tasks completed before certain information is used.