reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Modeling dynamic social vision highlights gaps between deep learning and humans

Authors: Kathy Garcia, Emalie McMahon, Colin Conwell, Michael Bonner, Leyla Isik

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Here, we extend a dataset of natural videos depicting complex multi-agent interactions by collecting humanannotated sentence captions for each video, and we benchmark 350+ image, video, and language models on behavior and neural responses to the videos.
Researcher Affiliation	Academia	1Department of Cognitive Science, 2Department of Biomedical Engineering Johns Hopkins University Baltimore, MD 21218, USA EMAIL
Pseudocode	No	The paper describes methods in prose, but does not contain any clearly labeled pseudocode or algorithm blocks.
Open Source Code	Yes	All code used in this paper and our sentence captions are publicly available: https://github. com/Isik-lab/SIf MRI_modeling.git.
Open Datasets	Yes	The social action ratings and f MRI responses are publicly available on OSF https://osf.io/4j29y/ with a Creative Commons Attribution 4.0 International (CC-BY-4.0) license. The videos shown to participants and used here to extract model activations are from the Moments in Time (Mi T) dataset http://moments.csail.mit.edu.
Dataset Splits	Yes	The dataset includes 250 three-second videos of social actions that are divided into 200 videos for training and 50 videos for evaluation.
Hardware Specification	Yes	We used an institutional high-performance computing cluster equipped with 31 A100 GPU nodes (with a mix of 40 and 80 GB memory).
Software Dependencies	No	The paper mentions software like "Deep Juice" and "spaCy" but does not provide specific version numbers for these or other key dependencies required for replication.
Experiment Setup	Yes	Before fitting the linear mapping, we first Z-scored the model-SRP feature space across the samples independently for each feature in the 200-video train set defined in the original dataset (Mc Mahon et al., 2023) and then normalized the held-out data from 50 videos by the mean and standard deviation from the train set. We normalized the behavioral and neural data using the same procedure. We performed linear mapping between the normalized model-SRP feature space and the normalized behavioral or neural response using leave-one-out ridge regression optimized for the GPU as implemented in Deep Juice (Conwell et al., 2024). Our α-penalty search space was seven values sampled from a logspace of 10e 2 to 10e5. In the training set, we performed 4-fold cross-validation in a full sweep of the model to determine the layer that produced the highest performance on the held-out data.