Towards flexible perception with visual memory

Authors: Robert Geirhos, Priyank Jaini, Austin Stone, Sourabh Medapati, Xi Yi, George Toderici, Abhijit Ogale, Jonathon Shlens

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Table 1 shows that with a visual memory it is possible to add new classes such that the in-distribution accuracy is maintained without catastrophic forgetting (the new classes only change Image Net validation performance by 0.02 0.04% depending on the aggregation method), while at the same time reaching very high accuracy on the new OOD classes (approx. 87% top-1) without any training. Figure 12 in the appendix confirms that the samples are indeed OOD for the model, as demonstrated by larger distances to nearest neighbors.
Researcher Affiliation Industry 1Google Deep Mind. Correspondence to: Robert Geirhos <EMAIL>, Priyank Jaini <EMAIL>.
Pseudocode No The paper describes methods like 'Fast inference using matrix multiplication on GPUs/TPUs' and 'Fast and scalable nearest neighbor search' in prose, but does not present them in a structured pseudocode or algorithm block.
Open Source Code Yes Code availability. Code to replicate experiments from this paper is available at https://github.com/ google-deepmind/visual-memory.
Open Datasets Yes For our experiments, our visual memory comprises of features extracted from a dataset like the Image Net-1K (Russakovsky et al., 2015) training set or JFT (Zhai et al., 2022) using different encoders like Dino V2 (Oquab et al., 2023) and CLIP (Radford et al., 2021). ... We took the new classes from the NINCO dataset (Bitterwolf et al., 2023), a dedicated OOD dataset... We test this using Dino V2 Vi T-L14 embeddings on the i Naturalist21 dataset (i Naturalist Team, 2021), a large-scale imbalanced dataset...
Dataset Splits Yes For our experiments, our visual memory comprises of features extracted from a dataset like the Image Net-1K (Russakovsky et al., 2015) training set or JFT (Zhai et al., 2022) using different encoders like Dino V2 (Oquab et al., 2023) and CLIP (Radford et al., 2021). ... Table 1: Flexible lifelong learning: adding OOD classes. A visual memory of Dino V2 Vi T-L14 with Image Net-train (IN-train) as memory database is able to handle a simple insert into memory operation for 64 out-of-distribution classes... query IN-val IN-val NINCO ... We test this using Dino V2 Vi T-L14 embeddings on the i Naturalist21 dataset (i Naturalist Team, 2021)... In a leave-one-out fashion, we simulate the discovery of a new species by putting 50 exemplars for each of the 9,999 species into memory and then iteratively adding more data for the remaining newly discovered species starting from zero exemplars all the way to 50 exemplars.
Hardware Specification No The paper mentions 'GPUs/TPUs' and 'CPUs' for running inference and search, but does not provide specific model numbers or detailed hardware specifications.
Software Dependencies No The paper mentions using ScaNN for nearest neighbor search and scikit-learn for CLIP linear probe results, but does not provide specific version numbers for these software components.
Experiment Setup Yes Softmax voting: Each neighbour is assigned a weight based on the softmax function i.e. wi = softmax dist( z, z[i] , τ) where τ is the temperature. This voting method is considered state-of-the-art; for example nearest neighbor accuracies of self-supervised models are reported using this method. A temperature of τ = 0.07 frequently appears in literature (Wu et al., 2018; Caron et al., 2021; Oquab et al., 2023) and is reported as a parameter which we do not tune in the Dino paper (Caron et al., 2021, p. 18). ... Rank voting: We propose using a simple aggregation approach wherein each neighbour is assigned a power-function weight based on its rank in the ordered set Neighbors( x) i.e. wi = 1/(α + ranki) where ranki is i and α is an offset to avoid division by zero that is set to 2.0. ... For Dino V2, the authors froze the model backbone and trained the linear layers for 12500 iterations using SGD. Instead of training a single time, they performed a full grid search sweep over three settings (output layers in 1, 4; pooling token concatenation in yes, no, and 13 different learning rates), resulting in 52 linear probes.