ImagineNav: Prompting Vision-Language Models as Embodied Navigator through Scene Imagination

Authors: Xinxin Zhao, Wenzhe Cai, Likun Tang, Teng Wang

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Empirical experiments on the challenging open-vocabulary object navigation benchmarks demonstrates the superiority of our proposed system.
Researcher Affiliation Academia Xinxin Zhao , Wenzhe Cai , Likun Tang , Teng Wang School of Automation, Southeast University EMAIL
Pseudocode No The paper describes the methodology using textual explanations and diagrams (e.g., Figure 2 for the overall pipeline) but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code No The paper does not contain an explicit statement about the release of source code or a link to a code repository for the described methodology.
Open Datasets Yes We evaluate the effectiveness and navigation efficiency of our proposed method using the Habitat v3.0 simulator (Puig et al., 2023) on two standard Object Nav datasets: HM3D (Ramakrishnan et al., 2021) and HSSD (Khanna et al., 2023).
Dataset Splits Yes The HM3D dataset offers high-fidelity reconstructions of 20 entire buildings, including 80 training scenes and 20 validation scenes. The HSSD dataset provides 40 high-quality synthetic scenes, comprising 110 training scenes and 40 validation scenes.
Hardware Specification No The paper does not provide specific hardware details (such as GPU or CPU models, memory, or cloud computing instance types) used for running the experiments.
Software Dependencies No The paper mentions the use of the Habitat v3.0 simulator and various models like GPT-4o-mini, but does not specify version numbers for programming languages (e.g., Python), deep learning frameworks (e.g., PyTorch), or other ancillary software components used for implementation.
Experiment Setup Yes Each episode has a maximum limit of 500 steps. The Move Ahead action moves the agent forward by 0.25m, while the rotational actions Turn Left and Turn Right rotate the agent by 30 degrees. The task is considered successful if the agent reaches the target object with a geodesic distance smaller than a defined threshold (e.g., 1m) and executes the Stop command within a fixed number of steps. For the data collection of the Where2Imagine module, we leveraged human demonstration trajectories from the MP3D (Chang et al., 2017) dataset within the habitat-web project with the camera height 0.88m and horizontal field of view (HFOV) of 79 . The Where2Imagine model with T=11, utilizing Res Net-18 trained from scratch and GPT-4o-mini as the VLM, was evaluated over 200 epochs on the HM3D and HSSD datasets.