reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Phenaki: Variable Length Video Generation from Open Domain Textual Descriptions

Authors: Ruben Villegas, Mohammad Babaeizadeh, Pieter-Jan Kindermans, Hernan Moraldo, Han Zhang, Mohammad Taghi Saffar, Santiago Castro, Julius Kunze, Dumitru Erhan

ICLR 2023 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	To evaluate Phenaki, we test it on the following tasks: 1) text conditional video generation, 2) textimage conditional video generation, 3) open domain time variable text conditional video generation (i.e.) story mode, 4) video quantization and 5) image conditional video generation a.k.a. video prediction.
Researcher Affiliation	Collaboration	Ruben Villegas Google Brain EMAIL Mohammad Babaeizadeh Google Brain EMAIL Pieter-Jan Kindermans Google Brain EMAIL Hernan Moraldo Google Brain EMAIL Han Zhang Google Brain EMAIL Mohammad Taghi Saffar Google Brain EMAIL Santiago Castro University of Michigan EMAIL Julius Kunze University College London EMAIL Dumitru Erhan Google Brain EMAIL
Pseudocode	No	The paper describes the architecture and processes, but does not provide formal pseudocode or algorithm blocks.
Open Source Code	No	Taken together, these issues contribute to our decision not to release the underlying models, code, data or interactive demo at this time.
Open Datasets	Yes	For image generation, there are datasets with billions of image-text pairs (such as LAION-5B [45] and JFT4B [67]) while the text-video datasets are substantially smaller e.g. Web Vid [4] with 10M videos... Unless speciﬁed otherwise, we train a 1.8B parameter Phenaki model on a corpus of 15M text-video pairs at 8 FPS mixed with 50M text-images plus 400M pairs of LAION-400M [45]... To evaluate the video encoding and reconstruction performance of C-Vi Vi T , we use the Momentsin-Time (Mi T) [33] dataset... For open domain videos, we test Phenaki on Kinetics600 [9]...
Dataset Splits	Yes	Mi T contains 802K training, 33K validation and 67K test videos at 25 FPS.
Hardware Specification	No	The paper does not provide specific details about the hardware used for training or experiments (e.g., GPU/CPU models, memory specifications). It mentions 'state-of-the-art computational capabilities' generally.
Software Dependencies	No	The paper mentions models and frameworks used (e.g., T5-XXL, VQ-GAN, Mask GIT) but does not provide specific version numbers for software dependencies (e.g., Python, PyTorch, CUDA versions).
Experiment Setup	Yes	Unless speciﬁed otherwise, we train a 1.8B parameter Phenaki model on a corpus of 15M textvideo pairs at 8 FPS mixed with 50M text-images plus 400M pairs of LAION-400M [45] (more details in Appendix B.3). The model used in the visualisations in this paper was trained for 1 million steps at a batch size of 512, which took less than 5 days. In this setup 80% of the training data came from the video dataset and each image dataset contributed 10%... we train using classiﬁer-free guidance by dropping the text condition 10% of the time during training [20, 65]... L = LVQ + 0.1 LAdv + 0.1 LIP + 1.0 LVP + 1.0 L2.