3D-Aware Video Generation
Authors: Sherwin Bahmani, Jeong Joon Park, Despoina Paschalidou, Hao Tang, Gordon Wetzstein, Leonidas Guibas, Luc Van Gool, Radu Timofte
TMLR 2023 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We conduct experiments to demonstrate the effectiveness of our approach in generating 3D-aware videos, focusing on the new visual effects it enables and the quality of generated imagery. Moreover, we conduct extensive ablation studies on the design components of our model. Section 4.1 is titled "Experimental Setup" and discusses Datasets and Metrics. Table 1, 2, 3 show Quantitative Results. Section 4.3 is titled "Ablation". |
| Researcher Affiliation | Academia | 1ETH Zürich 2Stanford University 3KU Leuven 4University of Würzburg. All listed affiliations are academic institutions. |
| Pseudocode | No | The paper describes the methodology using textual explanations, equations (Eq. 1-7), and architectural diagrams (Figure 2), but it does not contain any explicitly labeled 'Pseudocode' or 'Algorithm' block. |
| Open Source Code | No | We will release the source code for training and testing our algorithms upon acceptance. |
| Open Datasets | Yes | We evaluate our approach on three publicly available, unstructured video datasets: the Face Forensics (Rössler et al., 2019), the MEAD (Wang et al., 2020a), and the Tai Chi (Siarohin et al., 2019) dataset. |
| Dataset Splits | No | The paper mentions data used for evaluation (e.g., "The FVD protocol requires 2048 16-frame videos, while the FID score uses 50K images") and data processing steps (e.g., "we use every fourth frame to make the motion more dynamic" for Tai Chi), but it does not provide specific training/validation/test splits for the datasets themselves that were used to train their models. |
| Hardware Specification | Yes | We train our model and Style Ne RF using 4 NVIDIA V100 GPUs. |
| Software Dependencies | No | The paper mentions using the Adam optimizer (Kingma & Ba, 2015) and architectures like Style GAN2 (Karras et al., 2020), but does not provide specific version numbers for software libraries such as PyTorch, TensorFlow, or CUDA. |
| Experiment Setup | Yes | The 3D content code, motion code and style vector dimensions are all set to 512. Our motion generator (see Fig. 2) is implemented as an MLP with three fully connected (FC) layers and Leaky Re LU activations. We set the motion code and hidden dimension of the motion generator to 512, while the output dimension is 128. Our foreground and background Ne RF are modeled as MLPs (with Leaky Re LU activations) with 8 and 4 FC layers that each contain 128 and 64 hidden units, respectively. We use 10 frequency bands to map the positional input of the foreground background Ne RF to the fourier features (Mildenhall et al., 2020). Both the image and video discriminator follow the architecture of Style GAN2 (Karras et al., 2020) with hidden dimensions of 512, and the input channels being 3 and 7, respectively. For both the generator and discriminator, we use the Adam optimizer (Kingma & Ba, 2015) with a learning rate of 0.0025, β1 = 0, β2 = 0.99 and ϵ = 10 8. For our objective function (Eq. 7), we set λ1 = 0.5 and λ2 = 0.2. We use 16 samples for the Ne RF path regularization (Gu et al., 2022). The standard deviation for pitch sampling is 0.15 for all three datasets. For yaw sampling the standard deviation is 0.3, 0.3, and 0.8 for Face Forensics, MEAD, Tai Chi. The field-of-view of the camera is set to be 18 degrees. |