Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1]

Phenaki: Variable Length Video Generation from Open Domain Textual Descriptions

Authors: Ruben Villegas, Mohammad Babaeizadeh, Pieter-Jan Kindermans, Hernan Moraldo, Han Zhang, Mohammad Taghi Saffar, Santiago Castro, Julius Kunze, Dumitru Erhan

ICLR 2023 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental To evaluate Phenaki, we test it on the following tasks: 1) text conditional video generation, 2) textimage conditional video generation, 3) open domain time variable text conditional video generation (i.e.) story mode, 4) video quantization and 5) image conditional video generation a.k.a. video prediction.
Researcher Affiliation Collaboration Ruben Villegas Google Brain EMAIL Mohammad Babaeizadeh Google Brain EMAIL Pieter-Jan Kindermans Google Brain EMAIL Hernan Moraldo Google Brain EMAIL Han Zhang Google Brain EMAIL Mohammad Taghi Saffar Google Brain EMAIL Santiago Castro University of Michigan EMAIL Julius Kunze University College London EMAIL Dumitru Erhan Google Brain EMAIL
Pseudocode No The paper describes the architecture and processes, but does not provide formal pseudocode or algorithm blocks.
Open Source Code No Taken together, these issues contribute to our decision not to release the underlying models, code, data or interactive demo at this time.
Open Datasets Yes For image generation, there are datasets with billions of image-text pairs (such as LAION-5B [45] and JFT4B [67]) while the text-video datasets are substantially smaller e.g. Web Vid [4] with 10M videos... Unless specified otherwise, we train a 1.8B parameter Phenaki model on a corpus of 15M text-video pairs at 8 FPS mixed with 50M text-images plus 400M pairs of LAION-400M [45]... To evaluate the video encoding and reconstruction performance of C-Vi Vi T , we use the Momentsin-Time (Mi T) [33] dataset... For open domain videos, we test Phenaki on Kinetics600 [9]...
Dataset Splits Yes Mi T contains 802K training, 33K validation and 67K test videos at 25 FPS.
Hardware Specification No The paper does not provide specific details about the hardware used for training or experiments (e.g., GPU/CPU models, memory specifications). It mentions 'state-of-the-art computational capabilities' generally.
Software Dependencies No The paper mentions models and frameworks used (e.g., T5-XXL, VQ-GAN, Mask GIT) but does not provide specific version numbers for software dependencies (e.g., Python, PyTorch, CUDA versions).
Experiment Setup Yes Unless specified otherwise, we train a 1.8B parameter Phenaki model on a corpus of 15M textvideo pairs at 8 FPS mixed with 50M text-images plus 400M pairs of LAION-400M [45] (more details in Appendix B.3). The model used in the visualisations in this paper was trained for 1 million steps at a batch size of 512, which took less than 5 days. In this setup 80% of the training data came from the video dataset and each image dataset contributed 10%... we train using classifier-free guidance by dropping the text condition 10% of the time during training [20, 65]... L = LVQ + 0.1 LAdv + 0.1 LIP + 1.0 LVP + 1.0 L2.