TOMATO: Assessing Visual Temporal Reasoning Capabilities in Multimodal Foundation Models

Authors: Ziyao Shangguan, Chuhan Li, Yuxuan Ding, Yanan Zheng, Yilun Zhao, Tesca Fitzgerald, Arman Cohan

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our comprehensive evaluation reveals a human-model performance gap of 57.3% with the best-performing model. Moreover, our in-depth analysis uncovers more fundamental limitations beyond this gap in current MFMs. We present a comprehensive evaluation of 21 open-source models and 10 proprietary models on TOMATO, revealing a substantial gap between human-level and MFM-enabled visual temporal reasoning capabilities.
Researcher Affiliation Academia Ziyao Shangguan 1 Chuhan Li 1 Yuxuan Ding1 Yanan Zheng1 Yilun Zhao1 Tesca Fitzgerald1 Arman Cohan12 1Yale University 2Allen Institute for AI EMAIL
Pseudocode No No, the paper describes the methodology in prose and does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code Yes https://github.com/yale-nlp/TOMATO
Open Datasets Yes TOMATO comprises 1,484 carefully curated, human-annotated questions spanning six tasks... applied to 1,417 videos... sourcing from You Tube, four existing video datasets (Jang et al., 2017; Yi et al., 2020; Li et al., 2022; P atr aucean et al., 2023), as well as self-recorded and -generated videos ( 4.2). F License Information. ... You Tube videos. All videos sourced from You Tube3 are licensed under Creative Commons4. The original You Tube video links are provided in our dataset, and proper attribution is given in accordance with the license terms.
Dataset Splits No No, the paper evaluates pre-trained multimodal foundation models on the TOMATO benchmark rather than training new models on it, thus specific dataset splits for training/validation/testing are not applicable or provided in the context of their experiments.
Hardware Specification Yes We use NVIDIA A100 GPUs for all non-API-based evaluation.
Software Dependencies No No, the paper mentions using GPT-4o-mini for answer extraction and lists checkpoints for evaluated models, but it does not provide specific version numbers for ancillary software dependencies such as Python, PyTorch, or CUDA used in their evaluation setup.
Experiment Setup Yes All MFMs are evaluated using a zero-shot strategy across all benchmarks, including TOMATO, to ensure fair comparison. For metrics requiring multiple frames, we set m = 16, as our study across m = 1, 8, 16, 32 demonstrates that 16 frames provide a sufficient window for effective analysis ( E.3). For all models, we provide generation configuration in Table 8.