reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

3D-Aware Video Generation

Authors: Sherwin Bahmani, Jeong Joon Park, Despoina Paschalidou, Hao Tang, Gordon Wetzstein, Leonidas Guibas, Luc Van Gool, Radu Timofte

TMLR 2023 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We conduct experiments to demonstrate the effectiveness of our approach in generating 3D-aware videos, focusing on the new visual effects it enables and the quality of generated imagery. Moreover, we conduct extensive ablation studies on the design components of our model. Section 4.1 is titled "Experimental Setup" and discusses Datasets and Metrics. Table 1, 2, 3 show Quantitative Results. Section 4.3 is titled "Ablation".
Researcher Affiliation	Academia	1ETH Zürich 2Stanford University 3KU Leuven 4University of Würzburg. All listed affiliations are academic institutions.
Pseudocode	No	The paper describes the methodology using textual explanations, equations (Eq. 1-7), and architectural diagrams (Figure 2), but it does not contain any explicitly labeled 'Pseudocode' or 'Algorithm' block.
Open Source Code	No	We will release the source code for training and testing our algorithms upon acceptance.
Open Datasets	Yes	We evaluate our approach on three publicly available, unstructured video datasets: the Face Forensics (Rössler et al., 2019), the MEAD (Wang et al., 2020a), and the Tai Chi (Siarohin et al., 2019) dataset.
Dataset Splits	No	The paper mentions data used for evaluation (e.g., "The FVD protocol requires 2048 16-frame videos, while the FID score uses 50K images") and data processing steps (e.g., "we use every fourth frame to make the motion more dynamic" for Tai Chi), but it does not provide specific training/validation/test splits for the datasets themselves that were used to train their models.
Hardware Specification	Yes	We train our model and Style Ne RF using 4 NVIDIA V100 GPUs.
Software Dependencies	No	The paper mentions using the Adam optimizer (Kingma & Ba, 2015) and architectures like Style GAN2 (Karras et al., 2020), but does not provide specific version numbers for software libraries such as PyTorch, TensorFlow, or CUDA.
Experiment Setup	Yes	The 3D content code, motion code and style vector dimensions are all set to 512. Our motion generator (see Fig. 2) is implemented as an MLP with three fully connected (FC) layers and Leaky Re LU activations. We set the motion code and hidden dimension of the motion generator to 512, while the output dimension is 128. Our foreground and background Ne RF are modeled as MLPs (with Leaky Re LU activations) with 8 and 4 FC layers that each contain 128 and 64 hidden units, respectively. We use 10 frequency bands to map the positional input of the foreground background Ne RF to the fourier features (Mildenhall et al., 2020). Both the image and video discriminator follow the architecture of Style GAN2 (Karras et al., 2020) with hidden dimensions of 512, and the input channels being 3 and 7, respectively. For both the generator and discriminator, we use the Adam optimizer (Kingma & Ba, 2015) with a learning rate of 0.0025, β1 = 0, β2 = 0.99 and ϵ = 10 8. For our objective function (Eq. 7), we set λ1 = 0.5 and λ2 = 0.2. We use 16 samples for the Ne RF path regularization (Gu et al., 2022). The standard deviation for pitch sampling is 0.15 for all three datasets. For yaw sampling the standard deviation is 0.3, 0.3, and 0.8 for Face Forensics, MEAD, Tai Chi. The field-of-view of the camera is set to be 18 degrees.