ThinkBot: Embodied Instruction Following with Thought Chain Reasoning

Authors: Guanxing Lu, Ziwei Wang, Changliu Liu, Jiwen Lu, Yansong Tang

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments in the simulated environment show that our Think Bot outperforms the state-of-the-art EIF methods by a sizable margin in both success rate and execution efficiency. Project page: https://guanxinglu.github.io/thinkbot/. 4 EXPERIMENTS In this section, we first introduce the experiment setup including datasets, baseline methods, evaluation metrics and implementation details. Then we compare our method with the state-of-the-art EIF approaches to show the superiority in success rate and efficiency, and conduct an ablation study to verify the effectiveness of the instruction completer and the object localizer.
Researcher Affiliation Academia 1Tsinghua Shenzhen International Graduate School, Tsinghua University 2School of Electrical and Electronic Engineering, Nanyang Technological University 3Carnegie Mellon University 4Department of Automation, Tsinghua University {lgx23@mails.,lujiwen@,tang.yansong@sz.}tsinghua.edu.cn EMAIL EMAIL
Pseudocode No The paper describes methods and architectures but does not include any explicitly labeled 'Pseudocode' or 'Algorithm' blocks, nor does it present structured steps in a code-like format in the main body. Appendix C provides 'Full Prompt of the Instruction Completer' which defines the input and response format, not pseudocode for the algorithm itself.
Open Source Code Yes Project page: https://guanxinglu.github.io/thinkbot/.
Open Datasets Yes For the simulation of EIF tasks, we utilize the well-recognized ALFRED benchmark Shridhar et al. (2020) within the AI2-THOR Kolve et al. (2017) virtual environment. The ALFRED benchmark includes 25,743 trajectory-instruction pairs, covering 7 different task types with varying levels of complexity.
Dataset Splits Yes The benchmark is divided into five splits including train, test seen, test unseen, valid seen and valid unseen. The ALFRED benchmark poses significant challenges for EIF agents, as it requires them to ground incoherent natural instruction of different granularity into various household tasks that involve long-horizon reasoning plans.
Hardware Specification Yes In Table 6, we analyze the time consumption of each component in the whole system. We benchmark the time consumption in a single NVIDIA RTX 4090 GPU. In line with Min et al. (2022), we query the LLM after every 25 steps or subgoal completion to ensure consistency. Hence, the delay of LLM reasoning only affects steps that require important subgoal decisions (slow thinking), while most steps just involve path planning to achieve the subgoal (fast thinking), resulting in an acceptable average time per step. In summary, the whole system executes at 1.37Hz on average. For finetuning, we use Adam W optimizer Loshchilov & Hutter (2017) with an initial learning rate of 1 10 4 and weight decay of 5 10 2. Please refer to Wang et al. (2023a) for more training details. The whole finetuning process takes one day on 4 NVIDIA 3090 GPUs, where the batch size on each GPU is set to 4.
Software Dependencies Yes The instruction completer adopts the publicly released GPT-3.5 API GPT-3.5-turbo as the base model, where we set the generation temperature to 0 for stability enhancement. For prompt design, we leverage emotion prompt Li et al. (2023a) and prompt optimization Yang et al. (2023) in the system message template to further boost the performance of LLMs. For the multimodal object localizer, we employ a truncated Res Net18 Georgakis et al. (2022) for the map encoder. Adam W optimizer Loshchilov & Hutter (2017) with the initial learning rate 5 10 4 and step decay is employed for parameter update. We utilize the Intern Image-XL backbone pretrained on the COCO dataset Lin et al. (2014) with Cascade Mask R-CNN head implemented Wang et al. (2023a) on MMDetection Chen et al. (2019).
Experiment Setup Yes The instruction completer adopts the publicly released GPT-3.5 API GPT-3.5-turbo as the base model, where we set the generation temperature to 0 for stability enhancement. For prompt design, we leverage emotion prompt Li et al. (2023a) and prompt optimization Yang et al. (2023) in the system message template to further boost the performance of LLMs. For the multimodal object localizer, we employ a truncated Res Net18 Georgakis et al. (2022) for the map encoder. Adam W optimizer Loshchilov & Hutter (2017) with the initial learning rate 5 10 4 and step decay is employed for parameter update. For finetuning, we use Adam W optimizer Loshchilov & Hutter (2017) with an initial learning rate of 1 10 4 and weight decay of 5 10 2. Please refer to Wang et al. (2023a) for more training details. The whole finetuning process takes one day on 4 NVIDIA 3090 GPUs, where the batch size on each GPU is set to 4.