TOMATO: Assessing Visual Temporal Reasoning Capabilities in Multimodal Foundation Models
Authors: Ziyao Shangguan, Chuhan Li, Yuxuan Ding, Yanan Zheng, Yilun Zhao, Tesca Fitzgerald, Arman Cohan
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our comprehensive evaluation reveals a human-model performance gap of 57.3% with the best-performing model. Moreover, our in-depth analysis uncovers more fundamental limitations beyond this gap in current MFMs. We present a comprehensive evaluation of 21 open-source models and 10 proprietary models on TOMATO, revealing a substantial gap between human-level and MFM-enabled visual temporal reasoning capabilities. |
| Researcher Affiliation | Academia | Ziyao Shangguan 1 Chuhan Li 1 Yuxuan Ding1 Yanan Zheng1 Yilun Zhao1 Tesca Fitzgerald1 Arman Cohan12 1Yale University 2Allen Institute for AI EMAIL |
| Pseudocode | No | No, the paper describes the methodology in prose and does not include any explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | https://github.com/yale-nlp/TOMATO |
| Open Datasets | Yes | TOMATO comprises 1,484 carefully curated, human-annotated questions spanning six tasks... applied to 1,417 videos... sourcing from You Tube, four existing video datasets (Jang et al., 2017; Yi et al., 2020; Li et al., 2022; P atr aucean et al., 2023), as well as self-recorded and -generated videos ( 4.2). F License Information. ... You Tube videos. All videos sourced from You Tube3 are licensed under Creative Commons4. The original You Tube video links are provided in our dataset, and proper attribution is given in accordance with the license terms. |
| Dataset Splits | No | No, the paper evaluates pre-trained multimodal foundation models on the TOMATO benchmark rather than training new models on it, thus specific dataset splits for training/validation/testing are not applicable or provided in the context of their experiments. |
| Hardware Specification | Yes | We use NVIDIA A100 GPUs for all non-API-based evaluation. |
| Software Dependencies | No | No, the paper mentions using GPT-4o-mini for answer extraction and lists checkpoints for evaluated models, but it does not provide specific version numbers for ancillary software dependencies such as Python, PyTorch, or CUDA used in their evaluation setup. |
| Experiment Setup | Yes | All MFMs are evaluated using a zero-shot strategy across all benchmarks, including TOMATO, to ensure fair comparison. For metrics requiring multiple frames, we set m = 16, as our study across m = 1, 8, 16, 32 demonstrates that 16 frames provide a sufficient window for effective analysis ( E.3). For all models, we provide generation configuration in Table 8. |