FloNa: Floor Plan Guided Embodied Visual Navigation
Authors: Jiaxin Li, Weiqi Huang, Zan Wang, Wei Liang, Huijun Di, Feng Liu
AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We further collect 20k navigation episodes across 117 scenes in the i Gibson simulator to support the training and evaluation. Extensive experiments demonstrate the effectiveness and efficiency of our framework in unfamiliar scenes using floor plan knowledge. Extensive experiments demonstrate the effectiveness and efficiency of our method in navigating within unseen environments using a floor plan. |
| Researcher Affiliation | Collaboration | 1Beijing Institute of Technology, Beijing, China 2Yangtze Delta Region Academy of Beijing Institute of Technology, Jiaxing, China 3Beijing Racobit Electronic Information Technology Co., Ltd. EMAIL, EMAIL |
| Pseudocode | No | The paper describes methods using equations and textual descriptions of processes (e.g., in the 'Diffusion Model' and 'Diffusion Policy' sections) but does not include any explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | No | We recommend referring to our project website for the demonstration video of the planning results. |
| Open Datasets | No | For benchmarking, we collect a dataset comprising approximately 20k navigation episodes across 117 distinct scenes using the i Gibson simulator (Li et al. 2021a). The dataset includes around 3.3M images captured with a 45-degree field of view. We split the scenes into 67 for training and 50 for testing to assess the model s generalization capability to unseen environments. Each scene comprises a floor plan, a traversability map, and sufficient navigation episodes. Each episode contains an A*-generated trajectory paired with corresponding RGB observations. |
| Dataset Splits | Yes | We split the scenes into 67 for training and 50 for testing to assess the model s generalization capability to unseen environments. The dataset includes around 3.3M images captured with a 45-degree field of view. We split the dataset into training and testing sets, which comprise 67 scenes and 50 scenes, respectively. We train Flo Diff on the training set, which consists of 67 indoor scenes, encompassing 11, 575 episodes and approximately 26 hours of trajectory data. |
| Hardware Specification | Yes | We train Flo Diff using one NVIDIA RTX3090 GPU and assign a batch size of 256. Our model achieves an inference rate of approximately 1.88Hz when running on an NVIDIA Jetson AGX Orin. |
| Software Dependencies | No | In the implementation, Flo Diff is trained for 5 epochs using Adam W (Loshchilov, Hutter et al. 2017) optimizer with a fixed learning rate of 0.0001. The attention layers are built using the native Py Torch implementation. The diffusion policy is trained using the Square Cosine Noise Scheduler (Nichol and Dhariwal 2021) with K = 10 denoising steps. |
| Experiment Setup | Yes | In the implementation, Flo Diff is trained for 5 epochs using Adam W (Loshchilov, Hutter et al. 2017) optimizer with a fixed learning rate of 0.0001. We empirically set λ1 = λ3 = 0.001 and λ2 = 0.005. The attention layers are built using the native Py Torch implementation. The number of multi-head attention layers and heads are both 4. We set the dimension of the observation context vector ct to 256. The diffusion policy is trained using the Square Cosine Noise Scheduler (Nichol and Dhariwal 2021) with K = 10 denoising steps. The noise prediction network ϵθ adopts a conditional U-Net architecture following (Janner et al. 2022) with 15 convolutional layers. We set the diffusion horizon as Hp = 32 and employ the first Ha = 16 steps to execute in each iteration. |