FLAME: Learning to Navigate with Multimodal LLM in Urban Environments

Authors: Yunzhe Xu, Yiyuan Pan, Zhe Liu, Hesheng Wang

AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimental results demonstrate FLAME s superiority over existing methods, surpassing state-of-the-art methods by a 7.3% increase in task completion on Touchdown dataset.
Researcher Affiliation Academia Yunzhe Xu1, Yiyuan Pan2, Zhe Liu1*, Hesheng Wang2 1Mo E Key Lab of Artificial Intelligence, AI Institute, Shanghai Jiao Tong University, China 2Department of Automation, Shanghai Jiao Tong University, China EMAIL
Pseudocode No The paper describes methods and architectures through figures and text, but it does not contain explicit pseudocode or algorithm blocks.
Open Source Code Yes Code https://github.com/xyz9911/FLAME
Open Datasets Yes We evaluate our approach on two urban Visionand-Language Navigation (VLN) datasets: Touchdown (Chen et al. 2019) and Map2seq (Schumann and Riezler 2021), both set in the Street Learn environment (Mirowski et al. 2018).
Dataset Splits Yes The augmented dataset for the first two training phases contains 2,354 and 4,674 instances, respectively. Our agent is benchmarked against others on the original datasets. We collect 6,518 (out of 9,326) and 6,291 (out of 7,672) pairs grounded with rationales at key locations using GPT4 for Touchdown and Map2seq, respectively. We formulate synthetic pairs to create a dataset for evaluating reasoning capabilities. The reasoning performance is evaluated exclusively on the subset containing synthetic rationales to ensure a fair comparison. (Tables also show 'Dev Set' and 'Test Set' splits).
Hardware Specification Yes Our agent achieves remarkable computational efficiency, completing the entire training process in just 14 hours on a single A100 GPU.
Software Dependencies No Our agent is built upon Otter and Open Flamingo (Li et al. 2023a; Awadalla et al. 2023), integrating CLIP (Radford et al. 2021) and LLa MA (Touvron et al. 2023). While specific tools and frameworks are mentioned, their version numbers are not provided.
Experiment Setup Yes In the Touchdown task, we randomize the agent s heading at the start of each trajectory by selecting one of the possible directions based on its neighboring nodes. Training for the first two phases takes 1 hour each, while the navigation fine-tuning requires 12 hours on a single A100 GPU. The results are presented in Table 2. Rationale Coherence (RC) and Rationale-Action Alignment (RA) consistently remained above 80% and 95% respectively, indicating robust rationale generation and strong consistency. Higher temperatures led to performance fluctuations, especially with fewer decoding paths. However, when we increase the number of decoding paths to 8, we see more pronounced improvements in both TC and RC.