Separating Knowledge and Perception with Procedural Data
Authors: Adrian Rodriguez-Munoz, Manel Baradad, Phillip Isola, Antonio Torralba
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Compared to a model trained on Places, our procedural model performs within 1% on NIGHTS visual similarity, outperforms by 8% and 15% on CUB200 and Flowers102 fine-grained classification, and is within 10% on Image Net-1K classification. It also demonstrates strong zero-shot segmentation, achieving an R2 on COCO within 10% of the models trained on real data. Finally, we analyze procedural versus real data models, showing that parts of the same object have dissimilar representations in procedural models, resulting in incorrect searches in memory and explaining the remaining performance gap. |
| Researcher Affiliation | Academia | 1Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology, Cambridge, USA. Correspondence to: Adrian Rodr ıguez-Mu noz <EMAIL>. |
| Pseudocode | No | The paper describes methods and processes like 'Shaders KML Mixup process' (Section 3.1) and illustrates them with diagrams (Figure 7), but it does not contain any explicitly labeled 'Pseudocode' or 'Algorithm' blocks, nor structured steps formatted like code. |
| Open Source Code | No | The paper mentions 'On the original research code' in section 8, implying the existence of their own code. However, it does not provide any explicit statement about releasing this code, nor does it include any links to a code repository. The references include code from other projects (e.g., 'Gildenblat, J. and contributors. Pytorch library for cam methods.'), but this is not the code for the methodology described in this paper. |
| Open Datasets | Yes | We evaluate the procedural networks on a Human Visual Similarity (HVS) task using the NIGHTS dataset (Fu et al., 2023)... Table 1 shows KNN classification accuracy on various fine-grained datasets and Image Net-1K (Russakovsky et al., 2015)... Segmentation: Procedural models have remarkable semantic segmentation ability. Figure 9 qualitatively shows how procedural features clearly separate the bike and zebras from their surroundings. Quantitatively, Table 2 shows numerical R2 (ratio of explained variance to total variance) between principal component analysis (PCA) features and human labels. The best procedural model is within 10% of real data models and highly above random and RGB features. Procedural models are also capable of in-context segmentation: given a prompt image and a prompt mask representing a concept, they can effectively search for it in a new query image, even in the presence of equally colored distractions like in the second row of Figure 10. However, they struggle at KNN semantic segmentation with a large visual memory, as seen in Figure 11. As explained in Section 3, the DINO objective on real data teaches models to have similar representations for parts of real world objects, even when the parts are visually dissimilar. In contrast, procedural models, having never seen the object during training, will have dissimilar representations for the parts. We can observe this in Figure 9: the center and spokes of the wheel are colored the same in real models and differently in procedural models. Procedural models have excessively local representations which are vulnerable to spurious similarities with object parts of other classes. |
| Dataset Splits | Yes | We evaluate the procedural networks on a Human Visual Similarity (HVS) task using the NIGHTS dataset (Fu et al., 2023), an analysis missing in prior work. This benchmark consists of a Two Alternative Forced Choice (2AFC) on trios of images: given a reference and two options, which option has greater embedding cosine similarity with the reference? ... Results in Figure 5 show that procedural data metrics have performance within 1% of the Places model, trained on real data without domain overlap. The Image Net model has class overlap with NIGHTS, and thus is only for reference. ... In Figure 5 it visually appears that Shaders-based procedural models are all quite close in performance to each other and to the realistic Places model. To quantitatively test this, we performed a z-test and determined that Places, S. KML, and Shaders are all equivalent to the 5% level. This supports the finding that procedural models have reached the level of real models this benchmark. For the z-test, we used the average NIGHTS results and number of samples in the val dataset (1720). We also include standard deviations of the mean for reference in Table 4. |
| Hardware Specification | Yes | Training the linear classifier required 4 GPU-days on an 8-V100 node, while computing the embeddings of KNN classification required just 1.5 GPU-hours and is doable on a single GPU. |
| Software Dependencies | No | Appendix A states: 'We trained a vision transformers (Small Vi T) (Dosovitskiy et al., 2021) for each dataset (Image Net, Places, Shaders KML Mixup, Shaders KML, Shaders Mixup, Shaders, and Stylegan), using the recipe and architecture of the original DINO paper (Caron et al., 2021).' This mentions the core frameworks and models used but does not provide specific version numbers for software like Python, PyTorch, or CUDA, which are necessary for full reproducibility. |
| Experiment Setup | Yes | In particular, we used the optimal hyperparameters for the model trained on Image Net for all models, rather than hyper-optimizing for performance on each specific dataset. This results in a much more rigorous evaluation, as the optimal Image Net hyperparameters are more likely to be bad than good for procedural non-realistic data. These hyperparameters are: learning rate 1e-3, batch size 512, optimizer Adam W, num epochs 100, and DINO head out dim 65536. |