reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

FLAME: Learning to Navigate with Multimodal LLM in Urban Environments

Authors: Yunzhe Xu, Yiyuan Pan, Zhe Liu, Hesheng Wang

AAAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experimental results demonstrate FLAME s superiority over existing methods, surpassing state-of-the-art methods by a 7.3% increase in task completion on Touchdown dataset.
Researcher Affiliation	Academia	Yunzhe Xu1, Yiyuan Pan2, Zhe Liu1*, Hesheng Wang2 1Mo E Key Lab of Artificial Intelligence, AI Institute, Shanghai Jiao Tong University, China 2Department of Automation, Shanghai Jiao Tong University, China EMAIL
Pseudocode	No	The paper describes methods and architectures through figures and text, but it does not contain explicit pseudocode or algorithm blocks.
Open Source Code	Yes	Code https://github.com/xyz9911/FLAME
Open Datasets	Yes	We evaluate our approach on two urban Visionand-Language Navigation (VLN) datasets: Touchdown (Chen et al. 2019) and Map2seq (Schumann and Riezler 2021), both set in the Street Learn environment (Mirowski et al. 2018).
Dataset Splits	Yes	The augmented dataset for the first two training phases contains 2,354 and 4,674 instances, respectively. Our agent is benchmarked against others on the original datasets. We collect 6,518 (out of 9,326) and 6,291 (out of 7,672) pairs grounded with rationales at key locations using GPT4 for Touchdown and Map2seq, respectively. We formulate synthetic pairs to create a dataset for evaluating reasoning capabilities. The reasoning performance is evaluated exclusively on the subset containing synthetic rationales to ensure a fair comparison. (Tables also show 'Dev Set' and 'Test Set' splits).
Hardware Specification	Yes	Our agent achieves remarkable computational efficiency, completing the entire training process in just 14 hours on a single A100 GPU.
Software Dependencies	No	Our agent is built upon Otter and Open Flamingo (Li et al. 2023a; Awadalla et al. 2023), integrating CLIP (Radford et al. 2021) and LLa MA (Touvron et al. 2023). While specific tools and frameworks are mentioned, their version numbers are not provided.
Experiment Setup	Yes	In the Touchdown task, we randomize the agent s heading at the start of each trajectory by selecting one of the possible directions based on its neighboring nodes. Training for the first two phases takes 1 hour each, while the navigation fine-tuning requires 12 hours on a single A100 GPU. The results are presented in Table 2. Rationale Coherence (RC) and Rationale-Action Alignment (RA) consistently remained above 80% and 95% respectively, indicating robust rationale generation and strong consistency. Higher temperatures led to performance fluctuations, especially with fewer decoding paths. However, when we increase the number of decoding paths to 8, we see more pronounced improvements in both TC and RC.