reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Bootstrapping Language-Guided Navigation Learning with Self-Refining Data Flywheel

Authors: Zun Wang, Jialu Li, Yicong Hong, Songze Li, Kunchang Li, Shoubin Yu, Yi Wang, Yu Qiao, Yali Wang, Mohit Bansal, Limin Wang

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our experiments demonstrate that after several flywheel rounds, the navigator elevates the performance boundary from 70% to 78% SPL on the classic R2R test set, surpassing human performance (76%) for the first time. Meanwhile, this process results in a superior generator, evidenced by a SPICE increase from 23.5 to 26.2, better than all previous VLN instruction generation methods. Finally, we demonstrate the scalability of our method through increasing environment and instruction diversity, and the generalization ability of our pre-trained navigator across various downstream navigation tasks, surpassing state-of-the-art methods by a large margin in all cases.1
Researcher Affiliation	Collaboration	1Shanghai AI Laboratory 2UNC Chapel Hill 3Adobe Research 4Nanjing University 5 Shanghai Innovation Institute
Pseudocode	Yes	To summarize, the pseudocodes of the SRDF are detailed in Appendix Alg. 1. Algorithm 1 Pipeline of Self-Refining Data Flywheel (SRDF)
Open Source Code	Yes	1Code and data are available at https://github.com/wz0919/VLN-SRDF.
Open Datasets	Yes	We build our flywheel upon the R2R dataset (Anderson et al., 2018b) as DSeed, containing 14,039 human-annotated training data, along with the 178,270 and 2,890,267 unlabelled trajectories from MP3D (Chang et al., 2017) and HM3D (Ramakrishnan et al., 2021) environments, respectively, as DT raj. We run the flywheel for three rounds to create the final dataset, named SRDF-20M.
Dataset Splits	Yes	Each dataset is split into training, val_seen, and val_unseen sets, with R2R, CVDN, REVERIE, SOON, and R2R-CE also containing test splits. The statistics for the training splits are summarized in Table 2 (manually-labeled datasets), and further details can be found in the Appendix.
Hardware Specification	Yes	In our data flywheel, we pre-train the DUET navigator from scratch for 45,000 iterations using a batch size of 1024 and a learning rate of 5 ˆ 10 5 on 8 NVIDIA Tesla A100 GPUs.
Software Dependencies	No	The paper mentions specific models and architectures like "Sig LIP vision encoder", "LLa MA-3 language model backbone", "Mantis-8B-siglipllama3", but does not provide version numbers for general software dependencies such as Python, PyTorch, or CUDA.
Experiment Setup	Yes	In our data flywheel, we pre-train the DUET navigator from scratch for 45,000 iterations using a batch size of 1024 and a learning rate of 5 ˆ 10 5 on 8 NVIDIA Tesla A100 GPUs. Multiple checkpoints are fine-tuned to select the best pre-training model. The selected model is then finetuned for 6K iterations with a batch size of 16 on a single GPU using only the R2R dataset.