Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1]
Focus on Local: Finding Reliable Discriminative Regions for Visual Place Recognition
Authors: Changwei Wang, Shunpeng Chen, Yukun Song, Rongtao Xu, Zherui Zhang, Jiguang Zhang, Haoran Yang, Yu Zhang, Kexue Fu, Shide Du, Zhiwei Xu, Longxiang Gao, Li Guo, Shibiao Xu
AAAI 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Finally, experimental results show that our Fo L achieves the state-of-the-art on multiple VPR benchmarks in both image retrieval and re-ranking stages and also significantly outperforms existing two-stage VPR methods in terms of computational efficiency. Experiments Benchmarks and Performance Evaluation Our experiments utilize several VPR benchmark datasets to assess the performance of our models, focusing primarily on Tokyo24/7, Pitts250k, and MSLS, with additional evaluations on Nordland, Amster Time, SVOX, SPED, and SF-XL. Quantitative results are presented in Table 1. Ablation Study Table 4 reports the results of our proposed Fo L ablation experiments on the MSLS Challenge benchmark. |
| Researcher Affiliation | Academia | 1Key Laboratory of Computing Power Network and Information Security, Ministry of Education, Shandong Computer Science Center, Qilu University of Technology (Shandong Academy of Sciences) 2School of Artificial Intelligence, Beijing University of Posts and Telecommunications 3Shandong Provincial Key Laboratory of Computing Power Internet and Service Computing, Shandong Fundamental Research Center for Computer Science 4MAIS, Institute of Automation, Chinese Academy of Sciences 5Tongji University 6 College of Computer and Data Science, Fuzhou University 7Shandong University EMAIL |
| Pseudocode | Yes | Algorithm 1: Pseudo-correspondence Ground Truth Construction |
| Open Source Code | Yes | Code https://github.com/chenshunpeng/Fo L |
| Open Datasets | Yes | Our experiments utilize several VPR benchmark datasets to assess the performance of our models, focusing primarily on Tokyo24/7, Pitts250k, and MSLS, with additional evaluations on Nordland, Amster Time, SVOX, SPED, and SF-XL. Tokyo24/7 (Torii et al. 2015) ... Pitts250k (Torii et al. 2013) ... MSLS (Mapillary Street-Level Sequences) (Warburg et al. 2020) ... Nordland (S underhauf, Neubert, and Protzel 2013) ... Amster Time (Yildiz et al. 2022) ... SVOX (Berton et al. 2021) ... The SPED (Zaffar et al. 2021) and SF-XL (Berton, Masone, and Caputo 2022b) datasets ... We trained with one A100 GPU on GSV-Cities (Ali-bey et al. 2022). |
| Dataset Splits | Yes | Our experiments utilize several VPR benchmark datasets to assess the performance of our models, focusing primarily on Tokyo24/7, Pitts250k, and MSLS, with additional evaluations on Nordland, Amster Time, SVOX, SPED, and SF-XL. We follow the common evaluation metrics used in previous works (Ali-bey, Chaib-draa, and Gigu ere 2024). We assess performance using Recall@N, with a 25-meter threshold for Tokyo24/7, Pitts30k, and MSLS, and 10 frames for Nordland, effectively measuring retrieval accuracy under various conditions. We continuously monitored recall performance on the MSLS validation set. We follow mainstream works to use 25 meters as the threshold for correct scene and report recall@k (k=1,5,10) as evaluation metrics. |
| Hardware Specification | Yes | We trained with one A100 GPU on GSV-Cities (Ali-bey et al. 2022), a large city location dataset collected by Google Street View. |
| Software Dependencies | No | We initialize the Vi T-L backbone with pre-trained DINOv2 weights and fine-tune only the last four layers of the backbone. The Adam W optimizer with a linear learning rate schedule was used, with a learning rate of 6e 5 and a weight decay of 9.5e 9. |
| Experiment Setup | Yes | We initialize the Vi T-L backbone with pre-trained DINOv2 weights and fine-tune only the last four layers of the backbone. The remaining modules are set to learnable. In the feature extraction stage, the number of clusters M is set to 64. In the re-ranking stage, the number of channels up-conv are 256 and 128, and the convolution kernel size was 3 3, stride=2, padding=1. To speed up training, we used 322 322 images but evaluated on 504 504 resolution. Batch size is 60 and each batch is described by 4 images. The Adam W optimizer with a linear learning rate schedule was used, with a learning rate of 6e 5 and a weight decay of 9.5e 9. The training converged after 5 epochs. To ensure the validity of our experiments and optimize hyperparameter selection, we continuously monitored recall performance on the MSLS validation set. We follow mainstream works to use 25 meters as the threshold for correct scene and report recall@k (k=1,5,10) as evaluation metrics. Parameters: thr1 = 0.8, thr2 = 0.5, N = 8 (from Algorithm 1). |