reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

ADAM: An Embodied Causal Agent in Open-World Environments

Authors: Shu Yu, Chaochao Lu

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments show that ADAM constructs a nearly perfect causal graph from scratch, enabling efficient task decomposition and execution with strong interpretability. Notably, in the modified Minecraft game where no prior knowledge is available, ADAM excels with remarkable robustness and generalization capability.
Researcher Affiliation	Collaboration	1Shanghai Artificial Intelligence Laboratory 2Shanghai Innovation Institute 3Fudan University EMAIL
Pseudocode	No	The paper describes the architecture and components of ADAM (Interaction module, Causal model module, Controller module, Perception module) in detail, but it does not include any explicitly labeled 'Pseudocode' or 'Algorithm' blocks, nor does it present structured steps formatted like code or an algorithm.
Open Source Code	Yes	We have open-sourced the code of ADAM at https://github.com/Open Causa Lab/ADAM.
Open Datasets	No	The paper describes the creation of an MC-QA dataset in Appendix B, stating: "We utilize the crafting recipes in Minecraft (version 1.19) to create the MC-QA dataset." However, it does not provide any explicit link, DOI, or statement confirming the public availability of this dataset. The paper also mentions Minecraft, but this is the environment, not a dataset in the typical sense.
Dataset Splits	No	The paper mentions experimental procedures like "Each method has three trials for a maximum length of 200 steps" and "N samplings" for data collection within the Minecraft environment. For the internally created MC-QA dataset, it describes its structure ("754 QA pairs") but does not provide specific training/test/validation splits or percentages for this dataset.
Hardware Specification	No	The paper mentions using specific LLMs like "GPT-4-turbo (gpt-4-0125-preview)" and "LLaVA-v1.5-13B" for inference, but it does not specify any hardware details (e.g., GPU models, CPU types, memory) used to run their own experiments or for the training/inference of these models within their setup.
Software Dependencies	Yes	In our study, we employ Mineflayer (Prismarine JS, 2023a), a Java Script-based framework providing control APIs for the commercial Minecraft (version 1.19). For visual processing, we utilize prismarine-viewer (Prismarine JS, 2023b), an API for rendering game scenes from the agent s perspective. ADAM and our baselines all use GPT-4-turbo (gpt-4-0125-preview) for LLM inference... For visual description, we utilize LLaVA-v1.5-13B (Liu et al., 2024) in our perception module.
Experiment Setup	Yes	ADAM and our baselines all use GPT-4-turbo (gpt-4-0125-preview) for LLM inference, with the temperature set to 0.3 based on our experiments in Appendix A. Each method has three trials for a maximum length of 200 steps. The success rate is depicted in the parentheses.