reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Direct Motion Models for Assessing Generated Videos

Authors: Kelsey R Allen, Carl Doersch, Guangyao Zhou, Mohammed Suhail, Danny Driess, Ignacio Rocco, Yulia Rubanova, Thomas Kipf, Mehdi S. M. Sajjadi, Kevin Patrick Murphy, Joao Carreira, Sjoerd Van Steenkiste

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Here we go beyond FVD by developing a metric which better measures plausible object interactions and motion. Our novel approach is based on auto-encoding point tracks and yields motion features that can be used to not only compare distributions of videos (as few as one generated and one ground truth, or as many as two datasets), but also for evaluating motion of single videos. We show that using point tracks instead of pixel reconstruction or action recognition features results in a metric which is markedly more sensitive to temporal distortions in synthetic data, and can predict human evaluations of temporal consistency and realism in generated videos obtained from open-source models better than a wide range of alternatives. We also show that by using a point track representation, we can spatiotemporally localize generative video inconsistencies, providing extra interpretability of generated video errors relative to prior work. A current limitation of video generative video models is that they generate plausible looking frames, but poor motion an issue that is not well captured by FVD and other popular methods for evaluating generated videos. Here we go beyond FVD by developing a metric which better measures plausible object interactions and motion. Our novel approach is based on auto-encoding point tracks and yields motion features that can be used to not only compare distributions of videos (as few as one generated and one ground truth, or as many as two datasets), but also for evaluating motion of single videos. We show that using point tracks instead of pixel reconstruction or action recognition features results in a metric which is markedly more sensitive to temporal distortions in synthetic data, and can predict human evaluations of temporal consistency and realism in generated videos obtained from open-source models better than a wide range of alternatives. We also show that by using a point track representation, we can spatiotemporally localize generative video inconsistencies, providing extra interpretability of generated video errors relative to prior work. An overview of the results and link to the code can be found on the project page: trajan-paper.github.io. Section 4. Human evaluation, Section 5.1. Assessing generated videos individually, Section 5.2. Comparing real and generated video pairs, Section 5.3. Comparing video distributions.
Researcher Affiliation	Industry	Kelsey Allen * 1 Carl Doersch 1 Guangyao Zhou 1 Mohammed Suhail 2 Danny Driess 1 Ignacio Rocco 1 Yulia Rubanova 1 Thomas Kipf 1 Mehdi Sajjadi 1 Kevin Murphy 1 Joao Carreira 1 Sjoerd van Steenkiste * 2 *Equal contribution 1Google Deep Mind 2Google Research. Correspondence to: Kelsey Allen <EMAIL>, Sjoerd van Steenkiste <EMAIL>.
Pseudocode	No	The paper describes architectures and methods using diagrams (e.g., Figure 2 for TRAJAN) and descriptive text, but it does not include any clearly labeled pseudocode or algorithm blocks.
Open Source Code	Yes	An overview of the results and link to the code can be found on the project page: trajan-paper.github.io.
Open Datasets	Yes	We evaluate TRAJAN both on Video Phy (Bansal et al., 2024b) and Eval Crafter (Liu et al., 2024c). ... To obtain metrics for evaluating motion in videos, we focus on two aspects: how to extract latent representations from videos, and how to compute ordinal-valued metrics for videos that can directly be understood as ranking individual videos as being better or worse. ... We source generated videos from the solid-solid split of Video Phy (Bansal et al., 2024b), which contains videos generated by 8 different models, and Eval Crafter (Liu et al., 2024c), using 11 different models. ... We obtain real videos from the UCF101 dataset, which consists of 13,320 videos recorded in the wild that show humans performing different types of actions (Soomro et al., 2012).
Dataset Splits	Yes	Similar to prior work (Ge et al., 2024), we obtain the ground-truth reference distribution by combining the ﬁrst 32 frames of videos in the train and test split. ... We make use of the solid-solid split of Video Phy (Bansal et al., 2024b) available for download at https://huggingface.co/datasets/videophysics/videophy_test_public.
Hardware Specification	No	The paper does not provide specific hardware details such as GPU models, CPU types, or other computing resources used for the experiments. It mentions training models and batch sizes but no hardware specifications.
Software Dependencies	No	The paper does not provide specific ancillary software versions (e.g., Python 3.8, PyTorch 1.9, CUDA 11.1). It refers to various models and frameworks by name (e.g., Boots TAPIR, Perceiver, Adam, RAFT) but without specifying the software versions used for their implementation or dependencies.
Experiment Setup	Yes	We train with Adam (Kingma & Ba, 2015) with a warmup cosine learning rate schedule with 1000 warmup steps and a peak learning rate of 2e-4 for 1M steps with a batch size of 64. ... We train a WALT (Gupta et al., 2024) diffusion model (214M params) for frame-conditional video generation on the Kinetics-600 dataset (Carreira et al., 2018). The model architecture and hyperparameters are consistent with those used in the original paper. Training is conducted for 495,000 iterations with a batch size of 256, using videos at a resolution of 128 128 with 17 temporal frames. ... We train Moo G for 600K steps on a mixture of datasets, including Kinetics-700 (Carreira et al., 2018), SSv2 (Goyal et al., 2017), Scan Net (Dai et al., 2017), Ego4D (Grauman et al., 2022), and Walking Tours (Venkataramanan et al., 2024).