One Hundred Neural Networks and Brains Watching Videos: Lessons from Alignment
Authors: Christina Sartzetaki, Gemma Roig, Cees G Snoek, Iris Groen
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our work takes a step towards answering this question by performing the first large-scale benchmarking of deep video models on representational alignment to the human brain, using publicly available models and a recently released video brain imaging (f MRI) dataset. We disentangle four factors of variation in the models (temporal modeling, classification task, architecture, and training dataset) that affect alignment to the brain, which we measure by conducting Representational Similarity Analysis across multiple brain regions and model layers. We show that temporal modeling is key for alignment to brain regions involved in early visual processing, while a relevant classification task is key for alignment to higher-level regions. Moreover, we identify clear differences between the brain scoring patterns across layers of CNNs and Transformers, and reveal how training dataset biases transfer to alignment with functionally selective brain areas. Additionally, we uncover a negative correlation of computational complexity to brain alignment. Measuring a total of 99 neural networks and 10 human brains watching videos, we aim to forge a path that widens our understanding of temporal and semantic video representations in brains and machines, ideally leading towards more efficient video models and more mechanistic explanations of processing in the human brain. |
| Researcher Affiliation | Academia | 1Informatics Institute, University of Amsterdam, The Netherlands 2Department of Computer Science, Goethe University Frankfurt, Germany |
| Pseudocode | No | The paper describes methods and procedures in narrative text and equations (e.g., in Section 3 Methodology and 3.1 Alignment by Representational Similarity Analysis), but does not include any clearly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | Open-source code to reproduce this benchmarking study can be found in the github repository https://github.com/Sergeant Chris/hundred_models_brains, along with detailed installation and usage instructions. |
| Open Datasets | Yes | Most recently the Bold Moments Dataset (BMD) (Lahner et al., 2024) was introduced, putting forth a large-scale, highly reliable video f MRI dataset with extensive quality control to ensure suitability for AI model comparisons. In this work, we conduct the first extensive benchmarking of a total of 99 models on representational alignment with the f MRI data in BMD. ... We use the Bold Moments Dataset (BMD) (Lahner et al., 2024) consisting of whole-brain 3T f MRI recordings (2.5 2.5 2.5 mm voxels, resampled TR of 1s) from 10 subjects watching 1102 3s videos from the Moments in Time (Monfort et al., 2019) and Multi-Moments in Time (Monfort et al., 2021) video datasets. |
| Dataset Splits | Yes | The 1102 videos selected for the main experiment were split into a training and a testing set; 102 videos were randomly chosen for the testing set. ... For each subject, 1000 videos were shown for 3 repetitions and those recordings make up the training set , whereas 102 videos were shown for 10 repetitions and make up the test set . In our analysis, we only use the 102 videos of the test set, whose high number of repetitions allows for the application of RSA. |
| Hardware Specification | No | The MRI data were acquired with a 3 T Trio Siemens scanner using a 32-channel head coil. During the experimental runs, functional T2*-weighted gradient-echo echoplanar images (EPI) were collected... The paper specifies hardware for MRI data acquisition but does not specify any hardware (e.g., GPU, CPU models) used for running the neural network models or computational experiments. |
| Software Dependencies | No | We utilize the RSA implementation from the Net2Brain python library (Bersch et al., 2022). ... Image models trained on object recognition were ported from torchvision3 and timm4, while image and video models trained on action recognition were ported from mmaction25. ... MRI data was converted to BIDS format63 and preprocessed using the standardized f MRIPrep preprocessing pipeline (Esteban et al., 2019). ... GLMsingle (Prince et al., 2022). The paper mentions various software tools and libraries, some with citations (which imply a specific version or release of the tool at the time of publication of the cited work), but does not provide specific version numbers for the software components directly used in the benchmarking experiments (e.g., torchvision, timm, mmaction2, Net2Brain). |
| Experiment Setup | Yes | To create fl we first reduce the dimensionality of the original features to 100 Principal Components using Principal Component Analysis (PCA). ... For each model we permute the rows of all layer RDMs 1000 times using the same 1000 random permutations for all models and layers. |