Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1]
Phenaki: Variable Length Video Generation from Open Domain Textual Descriptions
Authors: Ruben Villegas, Mohammad Babaeizadeh, Pieter-Jan Kindermans, Hernan Moraldo, Han Zhang, Mohammad Taghi Saffar, Santiago Castro, Julius Kunze, Dumitru Erhan
ICLR 2023 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | To evaluate Phenaki, we test it on the following tasks: 1) text conditional video generation, 2) textimage conditional video generation, 3) open domain time variable text conditional video generation (i.e.) story mode, 4) video quantization and 5) image conditional video generation a.k.a. video prediction. |
| Researcher Affiliation | Collaboration | Ruben Villegas Google Brain EMAIL Mohammad Babaeizadeh Google Brain EMAIL Pieter-Jan Kindermans Google Brain EMAIL Hernan Moraldo Google Brain EMAIL Han Zhang Google Brain EMAIL Mohammad Taghi Saffar Google Brain EMAIL Santiago Castro University of Michigan EMAIL Julius Kunze University College London EMAIL Dumitru Erhan Google Brain EMAIL |
| Pseudocode | No | The paper describes the architecture and processes, but does not provide formal pseudocode or algorithm blocks. |
| Open Source Code | No | Taken together, these issues contribute to our decision not to release the underlying models, code, data or interactive demo at this time. |
| Open Datasets | Yes | For image generation, there are datasets with billions of image-text pairs (such as LAION-5B [45] and JFT4B [67]) while the text-video datasets are substantially smaller e.g. Web Vid [4] with 10M videos... Unless specified otherwise, we train a 1.8B parameter Phenaki model on a corpus of 15M text-video pairs at 8 FPS mixed with 50M text-images plus 400M pairs of LAION-400M [45]... To evaluate the video encoding and reconstruction performance of C-Vi Vi T , we use the Momentsin-Time (Mi T) [33] dataset... For open domain videos, we test Phenaki on Kinetics600 [9]... |
| Dataset Splits | Yes | Mi T contains 802K training, 33K validation and 67K test videos at 25 FPS. |
| Hardware Specification | No | The paper does not provide specific details about the hardware used for training or experiments (e.g., GPU/CPU models, memory specifications). It mentions 'state-of-the-art computational capabilities' generally. |
| Software Dependencies | No | The paper mentions models and frameworks used (e.g., T5-XXL, VQ-GAN, Mask GIT) but does not provide specific version numbers for software dependencies (e.g., Python, PyTorch, CUDA versions). |
| Experiment Setup | Yes | Unless specified otherwise, we train a 1.8B parameter Phenaki model on a corpus of 15M textvideo pairs at 8 FPS mixed with 50M text-images plus 400M pairs of LAION-400M [45] (more details in Appendix B.3). The model used in the visualisations in this paper was trained for 1 million steps at a batch size of 512, which took less than 5 days. In this setup 80% of the training data came from the video dataset and each image dataset contributed 10%... we train using classifier-free guidance by dropping the text condition 10% of the time during training [20, 65]... L = LVQ + 0.1 LAdv + 0.1 LIP + 1.0 LVP + 1.0 L2. |