reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Impossible Videos

Authors: Zechen Bai, Hai Ci, Mike Zheng Shou

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Comprehensive evaluations reveal limitations and insights for future directions of video models... Based on this benchmark, we conduct comprehensive evaluations for mainstream video understanding models and generation models... Table 2: Evaluation Results of IPV-TXT Across Dimensions. This table compares the performance of state-of-the-art video generation models using the IPV-TXT benchmark as text prompts in the T2V setting. Table 3: Evaluation Results for Impossible Video Understanding. This table compares the performance of sota Video LLMs using the IPV-Vid benchmark.
Researcher Affiliation	Academia	Zechen Bai * 1 Hai Ci * 1 Mike Zheng Shou 1 1Show Lab, National University of Singapore, Singapore.
Pseudocode	No	The paper describes the methodology and evaluation process in prose, without presenting any structured pseudocode or algorithm blocks.
Open Source Code	No	Project page: https://showlab.github.io/Impossible-Videos/. The paper states 'We will make the data public to inspire future research,' but does not explicitly mention the release of source code for the methodology or provide a direct link to a code repository.
Open Datasets	Yes	To this end, we introduce IPV-BENCH, a novel benchmark designed to evaluate and foster progress in video understanding and generation. IPV-BENCH is underpinned by a comprehensive taxonomy... Based on the taxonomy, a prompt suite is constructed to evaluate video generation models, challenging their prompt following and creativity capabilities. In addition, a video benchmark is curated to assess Video LLMs on their ability of understanding impossible videos... We will make the data public to inspire future research. Project page: https://showlab.github.io/Impossible-Videos/.
Dataset Splits	Yes	To ensure a balanced evaluation, the dataset maintains a 1 : 1 ratio of synthetic to real-world videos. This task is framed as a binary classification problem and evaluated using average Accuracy and F1-score.
Hardware Specification	No	The paper evaluates existing video generation and understanding models using a newly introduced benchmark. It does not provide specific hardware details used for running these evaluations or for generating the benchmark videos.
Software Dependencies	No	The paper evaluates existing video generation and understanding models and uses tools like GPT-4o and CLIP for certain tasks, but it does not specify software dependencies with version numbers for its own experimental setup or methodology.
Experiment Setup	Yes	Specifically, we combine the six factors Subject Consistency, Background Consistency, Motion Smoothness, Aesthetic Quality, Imaging Quality, and Dynamic Degree from VBench to form our final metric... The weights we use for each factor are: 2.0, 2.0, 0.2, 0.2, 2.0, 1.0.