Sparse components distinguish visual pathways & their alignment to neural networks

Authors: Ammar I Marvi, Nancy Kanwisher, Meenakshi Khosla

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental To answer the first question we applied a data-driven approach to identify the dominant components of the f MRI response to natural images in the three visual pathways of human high-level cortex, and in intermediate layers of DNNs. We quantified representational alignment between DNNs and each of the dorsal, ventral, and lateral streams using linear encoding, RSA, and SCA and CMS. Through simulations, we first demonstrate the sensitivity of our proposed SCA framework to subtle changes in the axes of representations
Researcher Affiliation Academia Ammar I Marvi MIT EMAIL Nancy G Kanwisher MIT EMAIL Meenakshi Khosla UCSD EMAIL
Pseudocode Yes Algorithm 1 Sparse Component Alignment (SCA)
Open Source Code Yes Code for the analyses performed in this study will be made available at https://github.com/aimarvi/NSDstreams.
Open Datasets Yes We applied data-driven Bayesian non-negative matrix factorization (NMF) to identify the dominant components of visual representations in each stream in four subjects of the Natural Scenes Dataset (NSD), a massive naturalistic f MRI dataset (Allen et al., 2022). We leveraged this capability to analyze the Meadows dataset -a behavioral dataset from the NSD
Dataset Splits Yes Linear encoding falls into the set of methods belonging to category A and involves linearly combining responses from model units to predict voxel responses by . This approach is well-established in the neuroscience literature, as it aims to optimally align model and brain response spaces through linear transformations while minimizing the introduction of complex non-linearities. These transformations are often preferred due to the assumption that downstream readout mechanisms apply approximately-linear functions on their inputs (Cao & Yamins, 2024). We extracted feature activations from the ultimate (Alex Net) or penultimate (Res Net-50) pooling layer of convolutional models, or from the best performing attention head in vision transformers (Vi T), and used the activations to predict neural responses to a shared set of 1, 000 images viewed by each subject (via a ridge regression with an 80/20 train/test split).
Hardware Specification No The paper does not provide specific hardware details (e.g., exact GPU/CPU models, processor types with speeds, memory amounts, or detailed computer specifications) used for running its experiments.
Software Dependencies No The paper does not provide specific ancillary software details (e.g., library or solver names with version numbers) needed to replicate the experiment.
Experiment Setup Yes The number of components in each iteration is a free parameter, which we fix at C = 20 for consistency across subjects, streams, models, and layers. This was principally motivated by a previous study that used Bayesian information criteria to estimate the optimal number of components in modeling the ventral visual stream (Khosla et al., 2022). However, we note that similar results also arise when deriving between 10 to 30 components. We then identified the most consistent components across subjects using a shared set of 1, 000 images viewed by each subject.