Near, far: Patch-ordering enhances vision foundation models' scene understanding

Authors: Valentinos Pariza, Mohammadreza Salehi, Gertjan J Burghouts, Francesco Locatello, Yuki Asano

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We demonstrate Ne Co s utility by applying it to six different backbones and evaluating it on five datasets and five evaluation protocols, achieving performance gains from 6% to 16%. We set several new state-of-the-art performances, for example on the in-context segmentation benchmark of Balazevic et al. (2023), we outperform the previous methods such as Cr Ibo and DINOv2 on Pascal VOC and ADE20k by 4% to 13% across different metrics. ... 4 EXPERIMENTS ... 4.3 ABLATION STUDIES
Researcher Affiliation Collaboration 1 University of Amsterdam, 2 TNO, 3 Institute of Science and Technology Austria
Pseudocode No The paper describes the methodology using natural language and mathematical equations, accompanied by a high-level diagram (Figure 1), but does not contain any structured pseudocode or algorithm blocks.
Open Source Code No Due to the unavailability of the original implementation by Balazevic et al. (2023), we use the open implementation from Pariza et al. (2024). This implementation aligns with the original authors description and details, including the use of the Sca NN Library (Guo et al., 2020) for efficient nearest neighbor retrieval. We adhere to the setup from the Hummingbird Model authors (Balazevic et al., 2023) for our experiments.
Open Datasets Yes We train our model on Image Net-100 (Tian et al., 2020), Pascal VOC12 (Everingham et al., 2010), and COCO (Lin et al., 2014) for ablations and use COCO as our primary training dataset for all state-of-the-art comparisons. For evaluations, we use the validation sets of Pascal VOC12 (Everingham et al., 2010), COCO (Lin et al., 2014), ADE20k (Zhou et al., 2017), and Pascal Context (Mottaghi et al., 2014). For finetuning and feature transferability evaluations on COCO (Caesar et al., 2018), we train using a 10% split of the training set, while we use the full training splits of the other datasets. For the 3D understanding benchmark, we use Spair-71k (Min et al., 2019) dataset.
Dataset Splits Yes For evaluations, we use the validation sets of Pascal VOC12 (Everingham et al., 2010), COCO (Lin et al., 2014), ADE20k (Zhou et al., 2017), and Pascal Context (Mottaghi et al., 2014). For finetuning and feature transferability evaluations on COCO (Caesar et al., 2018), we train using a 10% split of the training set, while we use the full training splits of the other datasets. ... The final results are reported as mean Intersection over Union (m Io U) on four different fractions of two datasets: Pascal VOC 2012 (Everingham et al.) and ADE20K (Zhou et al., 2017). The sub-sampling factors are 1, 8, 64, or 128. For factors greater than 1, results are averaged over five different seeds. ... Pascal VOC 2012 (Everingham et al.) This dataset, the latest split version of trainaug, features 10,582 images and their annotations distributed across 21 classes, with one referring to the background class. The validation set consists of 1,449 images. ... The training set comprises 118,000 images, and the validation set contains 5,000 images. (COCO) ... with 20,210 images in the training set and 2,000 images in the validation set (ADE20K) ... Pascal Context ... includes 4,998 training images covering 60 semantic classes... The validation set consists of 5,105 images.
Hardware Specification Yes We post-pretrain these models for 25 COCO epochs on a single NVIDIA RTX A6000-46GB GPU, taking around 19 hours. ... All experiments are conducted on 8 NVIDIA RTX A6000-46GB GPUs.
Software Dependencies No Our model is implemented in Python, using Torch (Paszke et al., 2019) and Py Torch Lightning (Falcon & team, 2019). ... We use the Segmenter implementation available within the MMSegmentation Library (MMSegmentation Contributors, 2020). ... We use K-Means (using faiss (Johnson et al., 2019)) ... We use the open implementation from Pariza et al. (2024). This implementation aligns with the original authors description and details, including the use of the Sca NN Library (Guo et al., 2020) for efficient nearest neighbor retrieval. The paper mentions various software libraries and frameworks (Python, Torch, PyTorch Lightning, MMSegmentation Library, Faiss, ScaNN) but does not provide specific version numbers for any of them.
Experiment Setup Yes We run our experiments on Vi T-Small and Vi T-Base with a patch size of 14. We post-pretrain these models for 25 COCO epochs... Our Data augmentations are the same as in Ziegler & Asano (2022). More specifically, we use: random color-jitter, Gaussian blur, grayscale and multi-crop augmentations. Similarly, the global crop s resolution is 224x224 and the local crop s resolution is 96x96, for the all the experiments except when working with Dinov2 where we use 518x518 for global crops and 98x98 for local crops. ... We train both network sizes with a cosine learning rate schedule going down to 0 over 25 training epochs, except for the ablation studies where we use 10 epochs. The initial projection head learning rate is 1e 4 for all the experiments, whereas the backbone s learning rate is 1e 5, with the exception of being 1e 6 when applying our method on Dinov2. The exponential moving average for updating the teacher s weights is adapted with a cosine schedule starting at 0.9995 and going up to 1. We use Adam Optimizer (Kingma & Ba, 2017) with a cosine weight decay schedule. ... By default we use the Bitonic Differentiable Sorting Networks (Petersen et al., 2021) and the steepnesses (i.e., inverse temperatures) used for the network are 100 for the Student and 100 for the teacher. ... For training the linear head, we downsample the segmentation masks to 100x100 to increase training speed. We use Stochastic Gradient Descent with a weight decay of 0.0001, a momentum of 0.9, and a step learning rate scheduler. We found that a learning rate of 0.01 works quite well for the backbone models we evaluated and our setup. We fine-tune the linear heads for 20 epochs. ... For the ADE20K and COCO-Stuff 164K datasets, we use 160k iterations, and for Pascal VOC 2012 and Pascal Context, we use 80k iterations, all with an eta_min of 0.1 lr. We use the Adam optimizer (Kingma & Ba, 2017) and for each pretraining method and dataset, we experiment with four different learning rates (8 10 5, 3 10 5, 1 10 5, 8 10 6) before reporting the highest m Io U score.