reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

TOMATO: Assessing Visual Temporal Reasoning Capabilities in Multimodal Foundation Models

Authors: Ziyao Shangguan, Chuhan Li, Yuxuan Ding, Yanan Zheng, Yilun Zhao, Tesca Fitzgerald, Arman Cohan

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our comprehensive evaluation reveals a human-model performance gap of 57.3% with the best-performing model. Moreover, our in-depth analysis uncovers more fundamental limitations beyond this gap in current MFMs. We present a comprehensive evaluation of 21 open-source models and 10 proprietary models on TOMATO, revealing a substantial gap between human-level and MFM-enabled visual temporal reasoning capabilities.
Researcher Affiliation	Academia	Ziyao Shangguan 1 Chuhan Li 1 Yuxuan Ding1 Yanan Zheng1 Yilun Zhao1 Tesca Fitzgerald1 Arman Cohan12 1Yale University 2Allen Institute for AI EMAIL
Pseudocode	No	No, the paper describes the methodology in prose and does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code	Yes	https://github.com/yale-nlp/TOMATO
Open Datasets	Yes	TOMATO comprises 1,484 carefully curated, human-annotated questions spanning six tasks... applied to 1,417 videos... sourcing from You Tube, four existing video datasets (Jang et al., 2017; Yi et al., 2020; Li et al., 2022; P atr aucean et al., 2023), as well as self-recorded and -generated videos ( 4.2). F License Information. ... You Tube videos. All videos sourced from You Tube3 are licensed under Creative Commons4. The original You Tube video links are provided in our dataset, and proper attribution is given in accordance with the license terms.
Dataset Splits	No	No, the paper evaluates pre-trained multimodal foundation models on the TOMATO benchmark rather than training new models on it, thus specific dataset splits for training/validation/testing are not applicable or provided in the context of their experiments.
Hardware Specification	Yes	We use NVIDIA A100 GPUs for all non-API-based evaluation.
Software Dependencies	No	No, the paper mentions using GPT-4o-mini for answer extraction and lists checkpoints for evaluated models, but it does not provide specific version numbers for ancillary software dependencies such as Python, PyTorch, or CUDA used in their evaluation setup.
Experiment Setup	Yes	All MFMs are evaluated using a zero-shot strategy across all benchmarks, including TOMATO, to ensure fair comparison. For metrics requiring multiple frames, we set m = 16, as our study across m = 1, 8, 16, 32 demonstrates that 16 frames provide a sufficient window for effective analysis ( E.3). For all models, we provide generation configuration in Table 8.