reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Separating Knowledge and Perception with Procedural Data

Authors: Adrian Rodriguez-Munoz, Manel Baradad, Phillip Isola, Antonio Torralba

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Compared to a model trained on Places, our procedural model performs within 1% on NIGHTS visual similarity, outperforms by 8% and 15% on CUB200 and Flowers102 fine-grained classification, and is within 10% on Image Net-1K classification. It also demonstrates strong zero-shot segmentation, achieving an R2 on COCO within 10% of the models trained on real data. Finally, we analyze procedural versus real data models, showing that parts of the same object have dissimilar representations in procedural models, resulting in incorrect searches in memory and explaining the remaining performance gap.
Researcher Affiliation	Academia	1Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology, Cambridge, USA. Correspondence to: Adrian Rodr ıguez-Mu noz <EMAIL>.
Pseudocode	No	The paper describes methods and processes like 'Shaders KML Mixup process' (Section 3.1) and illustrates them with diagrams (Figure 7), but it does not contain any explicitly labeled 'Pseudocode' or 'Algorithm' blocks, nor structured steps formatted like code.
Open Source Code	No	The paper mentions 'On the original research code' in section 8, implying the existence of their own code. However, it does not provide any explicit statement about releasing this code, nor does it include any links to a code repository. The references include code from other projects (e.g., 'Gildenblat, J. and contributors. Pytorch library for cam methods.'), but this is not the code for the methodology described in this paper.
Open Datasets	Yes	We evaluate the procedural networks on a Human Visual Similarity (HVS) task using the NIGHTS dataset (Fu et al., 2023)... Table 1 shows KNN classification accuracy on various fine-grained datasets and Image Net-1K (Russakovsky et al., 2015)... Segmentation: Procedural models have remarkable semantic segmentation ability. Figure 9 qualitatively shows how procedural features clearly separate the bike and zebras from their surroundings. Quantitatively, Table 2 shows numerical R2 (ratio of explained variance to total variance) between principal component analysis (PCA) features and human labels. The best procedural model is within 10% of real data models and highly above random and RGB features. Procedural models are also capable of in-context segmentation: given a prompt image and a prompt mask representing a concept, they can effectively search for it in a new query image, even in the presence of equally colored distractions like in the second row of Figure 10. However, they struggle at KNN semantic segmentation with a large visual memory, as seen in Figure 11. As explained in Section 3, the DINO objective on real data teaches models to have similar representations for parts of real world objects, even when the parts are visually dissimilar. In contrast, procedural models, having never seen the object during training, will have dissimilar representations for the parts. We can observe this in Figure 9: the center and spokes of the wheel are colored the same in real models and differently in procedural models. Procedural models have excessively local representations which are vulnerable to spurious similarities with object parts of other classes.
Dataset Splits	Yes	We evaluate the procedural networks on a Human Visual Similarity (HVS) task using the NIGHTS dataset (Fu et al., 2023), an analysis missing in prior work. This benchmark consists of a Two Alternative Forced Choice (2AFC) on trios of images: given a reference and two options, which option has greater embedding cosine similarity with the reference? ... Results in Figure 5 show that procedural data metrics have performance within 1% of the Places model, trained on real data without domain overlap. The Image Net model has class overlap with NIGHTS, and thus is only for reference. ... In Figure 5 it visually appears that Shaders-based procedural models are all quite close in performance to each other and to the realistic Places model. To quantitatively test this, we performed a z-test and determined that Places, S. KML, and Shaders are all equivalent to the 5% level. This supports the finding that procedural models have reached the level of real models this benchmark. For the z-test, we used the average NIGHTS results and number of samples in the val dataset (1720). We also include standard deviations of the mean for reference in Table 4.
Hardware Specification	Yes	Training the linear classifier required 4 GPU-days on an 8-V100 node, while computing the embeddings of KNN classification required just 1.5 GPU-hours and is doable on a single GPU.
Software Dependencies	No	Appendix A states: 'We trained a vision transformers (Small Vi T) (Dosovitskiy et al., 2021) for each dataset (Image Net, Places, Shaders KML Mixup, Shaders KML, Shaders Mixup, Shaders, and Stylegan), using the recipe and architecture of the original DINO paper (Caron et al., 2021).' This mentions the core frameworks and models used but does not provide specific version numbers for software like Python, PyTorch, or CUDA, which are necessary for full reproducibility.
Experiment Setup	Yes	In particular, we used the optimal hyperparameters for the model trained on Image Net for all models, rather than hyper-optimizing for performance on each specific dataset. This results in a much more rigorous evaluation, as the optimal Image Net hyperparameters are more likely to be bad than good for procedural non-realistic data. These hyperparameters are: learning rate 1e-3, batch size 512, optimizer Adam W, num epochs 100, and DINO head out dim 65536.