Vision CNNs trained to estimate spatial latents learned similar ventral-stream-aligned representations

Authors: Yudi Xie, Weichen Huang, Esther Alter, Jeremy Schwartz, Joshua B Tenenbaum, James DiCarlo

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We leveraged synthetic image datasets generated by a 3D graphic engine and trained convolutional neural networks (CNNs) to estimate different combinations of spatial and category latents. We found that models trained to estimate just a few spatial latents achieve neural alignment scores comparable to those trained on hundreds of categories, and the spatial latent performance of models strongly correlates with their neural alignment. After training, we evaluated the learned representations and their alignment with primate ventral stream neural data using various methods, including the Brain-Score open science platform (Schrimpf et al., 2018).
Researcher Affiliation Academia Yudi Xie Weichen Huang Esther Alter Jeremy Schwartz Joshua B. Tenenbaum James J. Di Carlo Department of Brain and Cognitive Sciences, MIT Correspondence: yu EMAIL
Pseudocode No The paper describes the methodology using prose and mathematical formulas for loss functions (Equations 1-13) but does not include any structured pseudocode or algorithm blocks.
Open Source Code Yes Code and datasets: https://github.com/Yudi Xie/multitask-vision
Open Datasets Yes Code and datasets: https://github.com/Yudi Xie/multitask-vision We generated several synthetic image datasets using Three DWorld (TDW) (Gan et al., 2020), a Unity-based 3D graphic engine (Figure 1 b). We generated image datasets that contain up to 100 million images from 117 object categories made up of 548 specific object 3D models (on average, about 5 object models per category).
Dataset Splits No Models are trained until they reach a plateau in test performance in a held-out test set. We extract the internal activations of our models from 4 different layers in Res Net-18 as in Table B.1. Those are activation of our models in response to 2000 held-out test images in the TDW-117 dataset that have full variations in all the latents we investigated. The paper mentions a held-out test set and the total size of the dataset (1.3 million images) but does not specify the explicit training/validation/test splits (e.g., percentages or exact counts for training and validation sets).
Hardware Specification No The paper does not provide specific hardware details such as GPU or CPU models, or memory specifications used for running the experiments.
Software Dependencies No We implemented the model backbone using the Py Torch library (https://pytorch.org/vision/main/models/resnet.html). The paper mentions PyTorch but does not specify a version number or other software dependencies with their versions.
Experiment Setup Yes We used mini-batch stochastic gradient descent with Adam optimizer (Kingma, 2014) with a learning rate of 0.001 for training the neural networks. Models are trained until they reached a plateau in test performance in a held-out test set. All experiments are trained with a batch size of 64. Res Net-50 models are trained with 1,000,000 batches, except those used for analysis in Figure C.2 are trained with 1,500,000 batches. Res Net-18 models are trained with 500,000 batches.