Modeling dynamic social vision highlights gaps between deep learning and humans

Authors: Kathy Garcia, Emalie McMahon, Colin Conwell, Michael Bonner, Leyla Isik

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Here, we extend a dataset of natural videos depicting complex multi-agent interactions by collecting humanannotated sentence captions for each video, and we benchmark 350+ image, video, and language models on behavior and neural responses to the videos.
Researcher Affiliation Academia 1Department of Cognitive Science, 2Department of Biomedical Engineering Johns Hopkins University Baltimore, MD 21218, USA EMAIL
Pseudocode No The paper describes methods in prose, but does not contain any clearly labeled pseudocode or algorithm blocks.
Open Source Code Yes All code used in this paper and our sentence captions are publicly available: https://github. com/Isik-lab/SIf MRI_modeling.git.
Open Datasets Yes The social action ratings and f MRI responses are publicly available on OSF https://osf.io/4j29y/ with a Creative Commons Attribution 4.0 International (CC-BY-4.0) license. The videos shown to participants and used here to extract model activations are from the Moments in Time (Mi T) dataset http://moments.csail.mit.edu.
Dataset Splits Yes The dataset includes 250 three-second videos of social actions that are divided into 200 videos for training and 50 videos for evaluation.
Hardware Specification Yes We used an institutional high-performance computing cluster equipped with 31 A100 GPU nodes (with a mix of 40 and 80 GB memory).
Software Dependencies No The paper mentions software like "Deep Juice" and "spaCy" but does not provide specific version numbers for these or other key dependencies required for replication.
Experiment Setup Yes Before fitting the linear mapping, we first Z-scored the model-SRP feature space across the samples independently for each feature in the 200-video train set defined in the original dataset (Mc Mahon et al., 2023) and then normalized the held-out data from 50 videos by the mean and standard deviation from the train set. We normalized the behavioral and neural data using the same procedure. We performed linear mapping between the normalized model-SRP feature space and the normalized behavioral or neural response using leave-one-out ridge regression optimized for the GPU as implemented in Deep Juice (Conwell et al., 2024). Our α-penalty search space was seven values sampled from a logspace of 10e 2 to 10e5. In the training set, we performed 4-fold cross-validation in a full sweep of the model to determine the layer that produced the highest performance on the held-out data.