EffoVPR: Effective Foundation Model Utilization for Visual Place Recognition
Authors: Issar Tzachor, Boaz Lerner, Matan Levy, Michael Green, Tal Berkovitz Shalev, Gavriel Habib, Dvir Samuel, Noam Zailer, Or Shimshi, Nir Darshan, Rami Ben-Ari
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We conduct a comprehensive ablation study using two different datasets to analyze the key parameters of our Effo VPR method. Key findings are summarized below, with detailed results provided in the appendix. Analyzing the different Vi T self-attention facets, we show that Value facet (V) offers the most effective local features for VPR re-ranking (Table S3). Moreover, the selection of the layer for feature extraction is important, with the n 1 layer delivering the best performance, whereas using the final layer n leads to a decline in results (Table S1). In Table S4 we show that our re-ranking stage remains effective even with as few as five candidates, achieving So TA results, highlighting the method s efficiency and effectiveness. |
| Researcher Affiliation | Collaboration | Issar Tzachor1, Boaz Lerner1, Matan Levy2, Michael Green1, Tal B Shalev1, Gavriel Habib1 Dvir Samuel1, Noam K Zailer1, Or Shimshi1, Nir Darshan1, Rami Ben-Ari1 1Origin AI, Israel 2The Hebrew University of Jerusalem, Israel EMAIL |
| Pseudocode | No | The paper describes methods and processes through narrative text and mathematical equations, but it does not include any clearly labeled 'Pseudocode' or 'Algorithm' blocks, nor does it present structured steps formatted like code. |
| Open Source Code | No | Following prior work, we have used (Berton et al., 2022c) open-source code for downloading and organizing datasets, to ensure maximum reproducibility. |
| Open Datasets | Yes | We evaluate Effo VPR on a large number of diverse datasets (20), including e.g. Pitts30k (Arandjelovic et al., 2016), Tokyo24/7 (Torii et al., 2015), MSLS-val/challenge (Warburg et al., 2020) Nordland (S underhauf et al., 2013) and more, exhibiting a wide variety of conditions, including different cities, day/night images, and seasonal changes. Note that MSLS Challenge (Warburg et al., 2020) is a hold-out set whose labels are not released, but researchers submit the predictions to the challenge server. |
| Dataset Splits | Yes | Mapillary Street-Level Sequences (MSLS) (Warburg et al., 2020) is image and sequence-based VPR dataset. The dataset consists of more than 1.6M geo-tagged images collected during over seven years from 30 cities, in urban, suburban, and natural environments. There are 3 non-overlap subsets a training set, validation (MSLS-val), and withheld test (MSLS-challenge). MSLS-val and MSLS-challenge provide various challenges, including viewpoint variations, long-term changes, and illumination and seasonal changes. (...) Pittsburgh30k (Arandjelovic et al., 2016) is collected from Google Street View 360 panoramas of downtown Pittsburgh, split into multiple images. Ensuring queries and gallery were taken in different years, it provides 3 splits a training set, validation and test. Pitts30k-test consists of 10k gallery images and 6816 queries. |
| Hardware Specification | Yes | We train Effo VPR with a batch size of 16, for 25 epochs, on a single NVIDIA A100 node. (...) Table S10: Local features dimension, memory footprint, latency and performance for different methods. Different methods utilized different GPUs for runtime evaluation: R2Former runtime was measured using RTX A5000, Sela VPR with RTX 3090, and Effo VPR with A100. |
| Software Dependencies | No | The paper mentions using Vi T-L/14 as the backbone, initialized with pre-trained weights of DINOv2, and optimizers AdamW and Adam. However, specific version numbers for these software components (e.g., PyTorch version, DINOv2 release version, CUDA version) are not provided. |
| Experiment Setup | Yes | We set an Adam W optimizer to the backbone, and an Adam optimizer to the classification heads, both with a constant learning rate of 1 10 5. We train Effo VPR with a batch size of 16, for 25 epochs (...) To retain the rich visual representations learned during pre-training, while adapting the model for the VPR task, we fine-tune only the final layers of our backbone. |