reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

MotionAura: Generating High-Quality and Motion Consistent Videos using Discrete Diffusion

Authors: Onkar Susladkar, Jishu Sen Gupta, Chirag Sehgal, Sparsh Mittal, Rekha Singhal

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	5 EXPERIMENTAL RESULTS We now present experimental results. The details of experimental setup are provided in Appendices C and D. Additional qualitative results are provided in Appendices A, G and F. 5.1 LATENT RECONSTRUCTION RESULTS OF 3D-MBQ-VAE For evaluating our 3D-MBQ-VAE, we selected COCO-2017 and Web VID validation datasets. Following Zhao et al. (2024), we crop each frame to 256 256 resolution and sample 48 frames per video sequentially. As shown in Table 1, our proposed 3D-MBQ-VAE consistently outperforms SOTA 3D VAEs across all metrics.
Researcher Affiliation	Collaboration	Onkar Kishor Susladkar1, Jishu Sen Gupta2, Chirag Sehgal3, Sparsh Mittal4, Rekha Singhal5 1Northwestern University, 1Yellow.ai, 2IIT BHU, 3Delhi Technological University, 4IIT Roorkee 5TCS Research EMAIL,EMAIL EMAIL EMAIL,EMAIL
Pseudocode	No	The paper describes methods through textual descriptions and architectural diagrams (e.g., Figure 2, 3, 4, 5, and Supplementary Figure S.8) but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code	Yes	We release the code, datasets, and models in open-source link .
Open Datasets	Yes	We curated two datasets, with each data point consisting of a Video-Mask-Sketch-Text conditioning, for our downstream task of sketch-guided video inpainting. We utilized You Tube-VOS and DAVIS datasets and captioned all the videos using Video LLa VA-7B-hf. Then, we performed a CLIP-based matching of videos with corresponding sketches from Quick Draw and Sketchy. For pre-training our 3D MBQ-VAE, we use the You Tube 100M Hershey et al. (2017) dataset. We evaluate the pre-trained models on the MSR-VTT dataset (Chen et al., 2022) using standard metrics such as FVD and CLIPSIM.
Dataset Splits	Yes	For training the diffusion model, we use the Webvid-10M Bain et al. (2022) with text as the condition... In both cases, we had an 80-20 split between the training and test set.
Hardware Specification	Yes	Inference Time was calculated on a single A100. (Table 3) and GPUs 4 x 8 A100 ... 8 x 8 A100 ... 6 x 8 A100 (from Table 10) and This training is conducted on 8 nodes, each equipped with 8 NVIDIA A100 GPUs (80 GB memory per GPU).
Software Dependencies	No	The paper mentions software tools like T5-XXL-Encoder, Video LLa VA-7B-hf, and Sig LIP image encoder, but it does not specify version numbers for key software components or libraries like Python, PyTorch, or CUDA.
Experiment Setup	Yes	D.1 HYPERPARAMETERS OF VARIOUS TECHNIQUES To ensure a fair comparison, we utilized the respective hyperparameters recommended in the original papers for each method. Table 10 outlines the specific hyperparameters used for training and inference across all baseline methods and our proposed approach. D.2 IMPLEMENTATION DETAILS OF 3D-MBQ-VAE PRE-TRAINING We train our 3D MB-VAE model on the You Tube100M video dataset... The Adam W optimizer is employed with a base learning rate of 1 10 4 with cosine learning rate decay. To reduce the risk of numerical overflow, we train the 3D MB-VAE model in float32 precision.