Vision CNNs trained to estimate spatial latents learned similar ventral-stream-aligned representations
Authors: Yudi Xie, Weichen Huang, Esther Alter, Jeremy Schwartz, Joshua B Tenenbaum, James DiCarlo
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We leveraged synthetic image datasets generated by a 3D graphic engine and trained convolutional neural networks (CNNs) to estimate different combinations of spatial and category latents. We found that models trained to estimate just a few spatial latents achieve neural alignment scores comparable to those trained on hundreds of categories, and the spatial latent performance of models strongly correlates with their neural alignment. After training, we evaluated the learned representations and their alignment with primate ventral stream neural data using various methods, including the Brain-Score open science platform (Schrimpf et al., 2018). |
| Researcher Affiliation | Academia | Yudi Xie Weichen Huang Esther Alter Jeremy Schwartz Joshua B. Tenenbaum James J. Di Carlo Department of Brain and Cognitive Sciences, MIT Correspondence: yu EMAIL |
| Pseudocode | No | The paper describes the methodology using prose and mathematical formulas for loss functions (Equations 1-13) but does not include any structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | Code and datasets: https://github.com/Yudi Xie/multitask-vision |
| Open Datasets | Yes | Code and datasets: https://github.com/Yudi Xie/multitask-vision We generated several synthetic image datasets using Three DWorld (TDW) (Gan et al., 2020), a Unity-based 3D graphic engine (Figure 1 b). We generated image datasets that contain up to 100 million images from 117 object categories made up of 548 specific object 3D models (on average, about 5 object models per category). |
| Dataset Splits | No | Models are trained until they reach a plateau in test performance in a held-out test set. We extract the internal activations of our models from 4 different layers in Res Net-18 as in Table B.1. Those are activation of our models in response to 2000 held-out test images in the TDW-117 dataset that have full variations in all the latents we investigated. The paper mentions a held-out test set and the total size of the dataset (1.3 million images) but does not specify the explicit training/validation/test splits (e.g., percentages or exact counts for training and validation sets). |
| Hardware Specification | No | The paper does not provide specific hardware details such as GPU or CPU models, or memory specifications used for running the experiments. |
| Software Dependencies | No | We implemented the model backbone using the Py Torch library (https://pytorch.org/vision/main/models/resnet.html). The paper mentions PyTorch but does not specify a version number or other software dependencies with their versions. |
| Experiment Setup | Yes | We used mini-batch stochastic gradient descent with Adam optimizer (Kingma, 2014) with a learning rate of 0.001 for training the neural networks. Models are trained until they reached a plateau in test performance in a held-out test set. All experiments are trained with a batch size of 64. Res Net-50 models are trained with 1,000,000 batches, except those used for analysis in Figure C.2 are trained with 1,500,000 batches. Res Net-18 models are trained with 500,000 batches. |