Bootstrapping Language-Guided Navigation Learning with Self-Refining Data Flywheel
Authors: Zun Wang, Jialu Li, Yicong Hong, Songze Li, Kunchang Li, Shoubin Yu, Yi Wang, Yu Qiao, Yali Wang, Mohit Bansal, Limin Wang
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our experiments demonstrate that after several flywheel rounds, the navigator elevates the performance boundary from 70% to 78% SPL on the classic R2R test set, surpassing human performance (76%) for the first time. Meanwhile, this process results in a superior generator, evidenced by a SPICE increase from 23.5 to 26.2, better than all previous VLN instruction generation methods. Finally, we demonstrate the scalability of our method through increasing environment and instruction diversity, and the generalization ability of our pre-trained navigator across various downstream navigation tasks, surpassing state-of-the-art methods by a large margin in all cases.1 |
| Researcher Affiliation | Collaboration | 1Shanghai AI Laboratory 2UNC Chapel Hill 3Adobe Research 4Nanjing University 5 Shanghai Innovation Institute |
| Pseudocode | Yes | To summarize, the pseudocodes of the SRDF are detailed in Appendix Alg. 1. Algorithm 1 Pipeline of Self-Refining Data Flywheel (SRDF) |
| Open Source Code | Yes | 1Code and data are available at https://github.com/wz0919/VLN-SRDF. |
| Open Datasets | Yes | We build our flywheel upon the R2R dataset (Anderson et al., 2018b) as DSeed, containing 14,039 human-annotated training data, along with the 178,270 and 2,890,267 unlabelled trajectories from MP3D (Chang et al., 2017) and HM3D (Ramakrishnan et al., 2021) environments, respectively, as DT raj. We run the flywheel for three rounds to create the final dataset, named SRDF-20M. |
| Dataset Splits | Yes | Each dataset is split into training, val_seen, and val_unseen sets, with R2R, CVDN, REVERIE, SOON, and R2R-CE also containing test splits. The statistics for the training splits are summarized in Table 2 (manually-labeled datasets), and further details can be found in the Appendix. |
| Hardware Specification | Yes | In our data flywheel, we pre-train the DUET navigator from scratch for 45,000 iterations using a batch size of 1024 and a learning rate of 5 ˆ 10 5 on 8 NVIDIA Tesla A100 GPUs. |
| Software Dependencies | No | The paper mentions specific models and architectures like "Sig LIP vision encoder", "LLa MA-3 language model backbone", "Mantis-8B-siglipllama3", but does not provide version numbers for general software dependencies such as Python, PyTorch, or CUDA. |
| Experiment Setup | Yes | In our data flywheel, we pre-train the DUET navigator from scratch for 45,000 iterations using a batch size of 1024 and a learning rate of 5 ˆ 10 5 on 8 NVIDIA Tesla A100 GPUs. Multiple checkpoints are fine-tuned to select the best pre-training model. The selected model is then finetuned for 6K iterations with a batch size of 16 on a single GPU using only the R2R dataset. |