reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Interactive Task Planning with Language Models

Authors: Boyi Li, Philipp Wu, Pieter Abbeel, Jitendra Malik

TMLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We verify the robustness of our system on the real world task of making milk tea drinks. Our system is able to generate novel high-level instructions for unseen objectives and successfully accomplishes user tasks. Furthermore, when the user sends a new request, our system is able to replan accordingly with precision based on the new request, task guidelines and previously executed steps. Table 1: Quantitative results with real robots for high-level planning rate and success rate with various user requests. Table 2: Replanning performance with real robots given human-in-the-loop feedback. Table 3: Comparison of executability (Exec) on Simulation (Virtual Home) with Prog Prompt.
Researcher Affiliation	Academia	Boyi Li* EMAIL UC Berkeley Philipp Wu* EMAIL UC Berkeley Pieter Abbeel EMAIL UC Berkeley Jitendra Malik EMAIL UC Berkeley
Pseudocode	No	The paper describes the system architecture and its modules in detail, including diagrams in Figure 1 and Figure 3. However, it does not contain a specific section or block labeled 'Pseudocode' or 'Algorithm' with structured code-like steps for any procedure.
Open Source Code	No	We hope our framework will be useful for accomplishing a wide range of interactive robot tasks and will release our codebase to foster advancements in this field. We aim for our open-source system to inspire more research into using both established and emerging models to enhance real-world robotics.
Open Datasets	Yes	We compare our high-level planning module to that of Prog Prompt (Singh et al., 2023) by leveraging the simulated Virtual Home (VH) Environment (Puig et al., 2018).
Dataset Splits	No	The paper mentions that for simulation tasks, 'each result is averaged over 5 runs in a single VH Environment across 10 different tasks.' While this indicates how runs were conducted, it does not specify explicit training, validation, or test dataset splits in terms of percentages or sample counts for any dataset used in the experiments. The real-world experiments use task guidelines with few-shot examples, but no formal dataset splits are described.
Hardware Specification	No	The paper mentions the use of a 'robot', 'overhead camera', 'GPT-4 (Open AI, 2023) as the language model backbone', and 'pretrained VLM: Grounded-DINO (Liu et al., 2023a)'. However, it does not provide specific hardware details such as exact GPU models, CPU types, or memory specifications for the experimental setup or the robot itself.
Software Dependencies	No	The paper references 'GPT-4 (Open AI, 2023)', 'Grounded-DINO (Liu et al., 2023a)', 'DINO model (Caron et al., 2021)', and 'Llama (Dubey et al., 2024)' as key models or environments. While these are specific tools, the paper does not provide specific version numbers for underlying software libraries, programming languages (e.g., Python 3.8), or solvers typically required for replication, as exemplified in the question.
Experiment Setup	No	The paper describes the system's interactive nature and prompt formats (e.g., in Figure 3 and Task Guidelines 1) for guiding the LLMs. It explicitly states that 'ITP does not require the training of additional value functions' and is a 'training-free robotic system'. Therefore, it does not provide specific hyperparameters, optimizer settings, or model training configurations, as the LLMs and VLMs used are pre-trained and not fine-tuned or trained by the authors for these experiments.