FLAME: Learning to Navigate with Multimodal LLM in Urban Environments
Authors: Yunzhe Xu, Yiyuan Pan, Zhe Liu, Hesheng Wang
AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experimental results demonstrate FLAME s superiority over existing methods, surpassing state-of-the-art methods by a 7.3% increase in task completion on Touchdown dataset. |
| Researcher Affiliation | Academia | Yunzhe Xu1, Yiyuan Pan2, Zhe Liu1*, Hesheng Wang2 1Mo E Key Lab of Artificial Intelligence, AI Institute, Shanghai Jiao Tong University, China 2Department of Automation, Shanghai Jiao Tong University, China EMAIL |
| Pseudocode | No | The paper describes methods and architectures through figures and text, but it does not contain explicit pseudocode or algorithm blocks. |
| Open Source Code | Yes | Code https://github.com/xyz9911/FLAME |
| Open Datasets | Yes | We evaluate our approach on two urban Visionand-Language Navigation (VLN) datasets: Touchdown (Chen et al. 2019) and Map2seq (Schumann and Riezler 2021), both set in the Street Learn environment (Mirowski et al. 2018). |
| Dataset Splits | Yes | The augmented dataset for the first two training phases contains 2,354 and 4,674 instances, respectively. Our agent is benchmarked against others on the original datasets. We collect 6,518 (out of 9,326) and 6,291 (out of 7,672) pairs grounded with rationales at key locations using GPT4 for Touchdown and Map2seq, respectively. We formulate synthetic pairs to create a dataset for evaluating reasoning capabilities. The reasoning performance is evaluated exclusively on the subset containing synthetic rationales to ensure a fair comparison. (Tables also show 'Dev Set' and 'Test Set' splits). |
| Hardware Specification | Yes | Our agent achieves remarkable computational efficiency, completing the entire training process in just 14 hours on a single A100 GPU. |
| Software Dependencies | No | Our agent is built upon Otter and Open Flamingo (Li et al. 2023a; Awadalla et al. 2023), integrating CLIP (Radford et al. 2021) and LLa MA (Touvron et al. 2023). While specific tools and frameworks are mentioned, their version numbers are not provided. |
| Experiment Setup | Yes | In the Touchdown task, we randomize the agent s heading at the start of each trajectory by selecting one of the possible directions based on its neighboring nodes. Training for the first two phases takes 1 hour each, while the navigation fine-tuning requires 12 hours on a single A100 GPU. The results are presented in Table 2. Rationale Coherence (RC) and Rationale-Action Alignment (RA) consistently remained above 80% and 95% respectively, indicating robust rationale generation and strong consistency. Higher temperatures led to performance fluctuations, especially with fewer decoding paths. However, when we increase the number of decoding paths to 8, we see more pronounced improvements in both TC and RC. |