reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

ThinkBot: Embodied Instruction Following with Thought Chain Reasoning

Authors: Guanxing Lu, Ziwei Wang, Changliu Liu, Jiwen Lu, Yansong Tang

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments in the simulated environment show that our Think Bot outperforms the state-of-the-art EIF methods by a sizable margin in both success rate and execution efficiency. Project page: https://guanxinglu.github.io/thinkbot/. 4 EXPERIMENTS In this section, we first introduce the experiment setup including datasets, baseline methods, evaluation metrics and implementation details. Then we compare our method with the state-of-the-art EIF approaches to show the superiority in success rate and efficiency, and conduct an ablation study to verify the effectiveness of the instruction completer and the object localizer.
Researcher Affiliation	Academia	1Tsinghua Shenzhen International Graduate School, Tsinghua University 2School of Electrical and Electronic Engineering, Nanyang Technological University 3Carnegie Mellon University 4Department of Automation, Tsinghua University {lgx23@mails.,lujiwen@,tang.yansong@sz.}tsinghua.edu.cn EMAIL EMAIL
Pseudocode	No	The paper describes methods and architectures but does not include any explicitly labeled 'Pseudocode' or 'Algorithm' blocks, nor does it present structured steps in a code-like format in the main body. Appendix C provides 'Full Prompt of the Instruction Completer' which defines the input and response format, not pseudocode for the algorithm itself.
Open Source Code	Yes	Project page: https://guanxinglu.github.io/thinkbot/.
Open Datasets	Yes	For the simulation of EIF tasks, we utilize the well-recognized ALFRED benchmark Shridhar et al. (2020) within the AI2-THOR Kolve et al. (2017) virtual environment. The ALFRED benchmark includes 25,743 trajectory-instruction pairs, covering 7 different task types with varying levels of complexity.
Dataset Splits	Yes	The benchmark is divided into five splits including train, test seen, test unseen, valid seen and valid unseen. The ALFRED benchmark poses significant challenges for EIF agents, as it requires them to ground incoherent natural instruction of different granularity into various household tasks that involve long-horizon reasoning plans.
Hardware Specification	Yes	In Table 6, we analyze the time consumption of each component in the whole system. We benchmark the time consumption in a single NVIDIA RTX 4090 GPU. In line with Min et al. (2022), we query the LLM after every 25 steps or subgoal completion to ensure consistency. Hence, the delay of LLM reasoning only affects steps that require important subgoal decisions (slow thinking), while most steps just involve path planning to achieve the subgoal (fast thinking), resulting in an acceptable average time per step. In summary, the whole system executes at 1.37Hz on average. For finetuning, we use Adam W optimizer Loshchilov & Hutter (2017) with an initial learning rate of 1 10 4 and weight decay of 5 10 2. Please refer to Wang et al. (2023a) for more training details. The whole finetuning process takes one day on 4 NVIDIA 3090 GPUs, where the batch size on each GPU is set to 4.
Software Dependencies	Yes	The instruction completer adopts the publicly released GPT-3.5 API GPT-3.5-turbo as the base model, where we set the generation temperature to 0 for stability enhancement. For prompt design, we leverage emotion prompt Li et al. (2023a) and prompt optimization Yang et al. (2023) in the system message template to further boost the performance of LLMs. For the multimodal object localizer, we employ a truncated Res Net18 Georgakis et al. (2022) for the map encoder. Adam W optimizer Loshchilov & Hutter (2017) with the initial learning rate 5 10 4 and step decay is employed for parameter update. We utilize the Intern Image-XL backbone pretrained on the COCO dataset Lin et al. (2014) with Cascade Mask R-CNN head implemented Wang et al. (2023a) on MMDetection Chen et al. (2019).
Experiment Setup	Yes	The instruction completer adopts the publicly released GPT-3.5 API GPT-3.5-turbo as the base model, where we set the generation temperature to 0 for stability enhancement. For prompt design, we leverage emotion prompt Li et al. (2023a) and prompt optimization Yang et al. (2023) in the system message template to further boost the performance of LLMs. For the multimodal object localizer, we employ a truncated Res Net18 Georgakis et al. (2022) for the map encoder. Adam W optimizer Loshchilov & Hutter (2017) with the initial learning rate 5 10 4 and step decay is employed for parameter update. For finetuning, we use Adam W optimizer Loshchilov & Hutter (2017) with an initial learning rate of 1 10 4 and weight decay of 5 10 2. Please refer to Wang et al. (2023a) for more training details. The whole finetuning process takes one day on 4 NVIDIA 3090 GPUs, where the batch size on each GPU is set to 4.