Direct Motion Models for Assessing Generated Videos

Authors: Kelsey R Allen, Carl Doersch, Guangyao Zhou, Mohammed Suhail, Danny Driess, Ignacio Rocco, Yulia Rubanova, Thomas Kipf, Mehdi S. M. Sajjadi, Kevin Patrick Murphy, Joao Carreira, Sjoerd Van Steenkiste

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Here we go beyond FVD by developing a metric which better measures plausible object interactions and motion. Our novel approach is based on auto-encoding point tracks and yields motion features that can be used to not only compare distributions of videos (as few as one generated and one ground truth, or as many as two datasets), but also for evaluating motion of single videos. We show that using point tracks instead of pixel reconstruction or action recognition features results in a metric which is markedly more sensitive to temporal distortions in synthetic data, and can predict human evaluations of temporal consistency and realism in generated videos obtained from open-source models better than a wide range of alternatives. We also show that by using a point track representation, we can spatiotemporally localize generative video inconsistencies, providing extra interpretability of generated video errors relative to prior work. A current limitation of video generative video models is that they generate plausible looking frames, but poor motion an issue that is not well captured by FVD and other popular methods for evaluating generated videos. Here we go beyond FVD by developing a metric which better measures plausible object interactions and motion. Our novel approach is based on auto-encoding point tracks and yields motion features that can be used to not only compare distributions of videos (as few as one generated and one ground truth, or as many as two datasets), but also for evaluating motion of single videos. We show that using point tracks instead of pixel reconstruction or action recognition features results in a metric which is markedly more sensitive to temporal distortions in synthetic data, and can predict human evaluations of temporal consistency and realism in generated videos obtained from open-source models better than a wide range of alternatives. We also show that by using a point track representation, we can spatiotemporally localize generative video inconsistencies, providing extra interpretability of generated video errors relative to prior work. An overview of the results and link to the code can be found on the project page: trajan-paper.github.io. Section 4. Human evaluation, Section 5.1. Assessing generated videos individually, Section 5.2. Comparing real and generated video pairs, Section 5.3. Comparing video distributions.
Researcher Affiliation Industry Kelsey Allen * 1 Carl Doersch 1 Guangyao Zhou 1 Mohammed Suhail 2 Danny Driess 1 Ignacio Rocco 1 Yulia Rubanova 1 Thomas Kipf 1 Mehdi Sajjadi 1 Kevin Murphy 1 Joao Carreira 1 Sjoerd van Steenkiste * 2 *Equal contribution 1Google Deep Mind 2Google Research. Correspondence to: Kelsey Allen <EMAIL>, Sjoerd van Steenkiste <EMAIL>.
Pseudocode No The paper describes architectures and methods using diagrams (e.g., Figure 2 for TRAJAN) and descriptive text, but it does not include any clearly labeled pseudocode or algorithm blocks.
Open Source Code Yes An overview of the results and link to the code can be found on the project page: trajan-paper.github.io.
Open Datasets Yes We evaluate TRAJAN both on Video Phy (Bansal et al., 2024b) and Eval Crafter (Liu et al., 2024c). ... To obtain metrics for evaluating motion in videos, we focus on two aspects: how to extract latent representations from videos, and how to compute ordinal-valued metrics for videos that can directly be understood as ranking individual videos as being better or worse. ... We source generated videos from the solid-solid split of Video Phy (Bansal et al., 2024b), which contains videos generated by 8 different models, and Eval Crafter (Liu et al., 2024c), using 11 different models. ... We obtain real videos from the UCF101 dataset, which consists of 13,320 videos recorded in the wild that show humans performing different types of actions (Soomro et al., 2012).
Dataset Splits Yes Similar to prior work (Ge et al., 2024), we obtain the ground-truth reference distribution by combining the first 32 frames of videos in the train and test split. ... We make use of the solid-solid split of Video Phy (Bansal et al., 2024b) available for download at https://huggingface.co/datasets/videophysics/videophy_test_public.
Hardware Specification No The paper does not provide specific hardware details such as GPU models, CPU types, or other computing resources used for the experiments. It mentions training models and batch sizes but no hardware specifications.
Software Dependencies No The paper does not provide specific ancillary software versions (e.g., Python 3.8, PyTorch 1.9, CUDA 11.1). It refers to various models and frameworks by name (e.g., Boots TAPIR, Perceiver, Adam, RAFT) but without specifying the software versions used for their implementation or dependencies.
Experiment Setup Yes We train with Adam (Kingma & Ba, 2015) with a warmup cosine learning rate schedule with 1000 warmup steps and a peak learning rate of 2e-4 for 1M steps with a batch size of 64. ... We train a WALT (Gupta et al., 2024) diffusion model (214M params) for frame-conditional video generation on the Kinetics-600 dataset (Carreira et al., 2018). The model architecture and hyperparameters are consistent with those used in the original paper. Training is conducted for 495,000 iterations with a batch size of 256, using videos at a resolution of 128 128 with 17 temporal frames. ... We train Moo G for 600K steps on a mixture of datasets, including Kinetics-700 (Carreira et al., 2018), SSv2 (Goyal et al., 2017), Scan Net (Dai et al., 2017), Ego4D (Grauman et al., 2022), and Walking Tours (Venkataramanan et al., 2024).