MotionAura: Generating High-Quality and Motion Consistent Videos using Discrete Diffusion
Authors: Onkar Susladkar, Jishu Sen Gupta, Chirag Sehgal, Sparsh Mittal, Rekha Singhal
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | 5 EXPERIMENTAL RESULTS We now present experimental results. The details of experimental setup are provided in Appendices C and D. Additional qualitative results are provided in Appendices A, G and F. 5.1 LATENT RECONSTRUCTION RESULTS OF 3D-MBQ-VAE For evaluating our 3D-MBQ-VAE, we selected COCO-2017 and Web VID validation datasets. Following Zhao et al. (2024), we crop each frame to 256 256 resolution and sample 48 frames per video sequentially. As shown in Table 1, our proposed 3D-MBQ-VAE consistently outperforms SOTA 3D VAEs across all metrics. |
| Researcher Affiliation | Collaboration | Onkar Kishor Susladkar1, Jishu Sen Gupta2, Chirag Sehgal3, Sparsh Mittal4, Rekha Singhal5 1Northwestern University, 1Yellow.ai, 2IIT BHU, 3Delhi Technological University, 4IIT Roorkee 5TCS Research EMAIL,EMAIL EMAIL EMAIL,EMAIL |
| Pseudocode | No | The paper describes methods through textual descriptions and architectural diagrams (e.g., Figure 2, 3, 4, 5, and Supplementary Figure S.8) but does not include any explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | We release the code, datasets, and models in open-source link . |
| Open Datasets | Yes | We curated two datasets, with each data point consisting of a Video-Mask-Sketch-Text conditioning, for our downstream task of sketch-guided video inpainting. We utilized You Tube-VOS and DAVIS datasets and captioned all the videos using Video LLa VA-7B-hf. Then, we performed a CLIP-based matching of videos with corresponding sketches from Quick Draw and Sketchy. For pre-training our 3D MBQ-VAE, we use the You Tube 100M Hershey et al. (2017) dataset. We evaluate the pre-trained models on the MSR-VTT dataset (Chen et al., 2022) using standard metrics such as FVD and CLIPSIM. |
| Dataset Splits | Yes | For training the diffusion model, we use the Webvid-10M Bain et al. (2022) with text as the condition... In both cases, we had an 80-20 split between the training and test set. |
| Hardware Specification | Yes | Inference Time was calculated on a single A100. (Table 3) and GPUs 4 x 8 A100 ... 8 x 8 A100 ... 6 x 8 A100 (from Table 10) and This training is conducted on 8 nodes, each equipped with 8 NVIDIA A100 GPUs (80 GB memory per GPU). |
| Software Dependencies | No | The paper mentions software tools like T5-XXL-Encoder, Video LLa VA-7B-hf, and Sig LIP image encoder, but it does not specify version numbers for key software components or libraries like Python, PyTorch, or CUDA. |
| Experiment Setup | Yes | D.1 HYPERPARAMETERS OF VARIOUS TECHNIQUES To ensure a fair comparison, we utilized the respective hyperparameters recommended in the original papers for each method. Table 10 outlines the specific hyperparameters used for training and inference across all baseline methods and our proposed approach. D.2 IMPLEMENTATION DETAILS OF 3D-MBQ-VAE PRE-TRAINING We train our 3D MB-VAE model on the You Tube100M video dataset... The Adam W optimizer is employed with a base learning rate of 1 10 4 with cosine learning rate decay. To reduce the risk of numerical overflow, we train the 3D MB-VAE model in float32 precision. |