DDLP: Unsupervised Object-centric Video Prediction with Deep Dynamic Latent Particles

Authors: Tal Daniel, Aviv Tamar

TMLR 2024 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our experiments: (1) compare DLPv2 to the original DLP in the single-image setting (Appendix I.1.1);(2) benchmark DDLP on video prediction against the state-of-the-art G-SWM and Slot Former; (3) demonstrate the capabilities of DDLP in answering "what if...?" questions by modifying the initial scenes in the latent space; (4) evaluate our design choices through ablation study; and (5) showcase an application for efficient unconditional video generation by learning a diffusion process in DDLP s latent space. We evaluate our model on 5 datasets with different dynamics and visual properties3. Evaluation metrics: for all datasets, we report the standard visual metrics5 PSNR, SSIM (Wang et al., 2004) and LPIPS (Zhang et al., 2018), to quantify the quality of the generated sequence compared to the ground-truth (GT) sequence. In addition, for Balls-Interaction we calculate the mean Euclidean error (MED10, Lin et al. 2020b) summed over the first 10 prediction steps as we have GT ball positions. We present our quantitative results in Tables 2 and 3, with additional results and rollouts available in Appendix I.
Researcher Affiliation Academia Tal Daniel EMAIL Electrical and Computer Engineering Technion Israel Institute of Technology Aviv Tamar EMAIL Electrical and Computer Engineering Technion Israel Institute of Technology
Pseudocode Yes Figure 18: Py Torch-style implementation of the transparency and depth factorization process in the stitching process. Figure 21: Efficient Py Torch implementation of normalized cross-correlation with group-convolution.
Open Source Code Yes Videos, code and pre-trained models are available: https://taldatech.github.io/ddlp-web/. Code and pre-trained models are available at: https://github.com/taldatech/ddlp.
Open Datasets Yes We evaluate our model on 5 datasets with different dynamics and visual properties3. Balls-Interaction: (Jiang et al., 2019) A 2D dataset of 3 random-colored balls bouncing and colliding. OBJ3D: (Lin et al., 2020b) A 3D dataset containing CLEVR-like objects (Johnson et al., 2017). CLEVRER: (Yi et al., 2019) A 3D dataset containing CLEVR-like objects. PHYRE: (Bakhtin et al., 2019) A 2D dataset of physical puzzles. Traffic: (Daniel & Tamar, 2022a) Real-world videos of a varying number (up to 20) of cars of different sizes and shapes driving along opposing lanes, captured by a traffic camera.
Dataset Splits Yes Balls-Interaction: (Jiang et al., 2019) A 2D dataset of 3 random-colored balls bouncing and colliding with each other. We use 10,000 episodes for training, 200 for validation and 200 for test. OBJ3D: (Lin et al., 2020b) The dataset contains 2,920 episodes for training, 200 for validation and 200 for test. CLEVRER: (Yi et al., 2019) We use the the first 5,000 videos for training, 1,000 for validation and 1,000 for test. PHYRE: (Bakhtin et al., 2019) We use 2,574 episodes for training, 312 for validation and 400 for test. Traffic: (Daniel & Tamar, 2022a) The dataset contains 133,000 frames, where we take 80% for training, 10% for validation and 10% for test.
Hardware Specification Yes Our experiments with DDLP and G-SWM were conducted primarily on machines equipped with 4 NVIDIA RTX 2080 11GB GPUs or 1 NVIDIA A4000 16GB GPU. We trained Slot Former on 4 NVIDIA A100 80GB GPUs.
Software Dependencies No The paper mentions: "Our method is implemented in Py Torch (Paszke et al., 2017)", "We base our implementation on the min GPT open-source implementation (Karpathy, 2021)", "We use the Adam (Kingma & Ba, 2014) optimizer", and "Denoising Diffusion Probabilistic Model (DDPM, Ho et al. (2020)) and the publicly available code base (Wang, 2022)". Specific version numbers for PyTorch, min GPT, or the DDPM codebase are not provided, only citations to their original works or general references.
Experiment Setup Yes DDLP is trained end-to-end, effectively regularizing the posterior particles to be predictable by the learned prior particles, with the Adam (Kingma & Ba, 2014) optimizer and initial learning rate of 2e 4 which is gradually decreased with a step scheduler. Hyper-parameters: all CNNs are initialized from a small Gaussian distribution N(0, 0.012) and use replication-padding. We use the Adam (Kingma & Ba, 2014) optimizer (β1 = 0.9, β2 = 0.999, ϵ = 1e 4) with initial learning rate of 2e 4 and a step scheduler that multiplies the learning rate by 0.95 at the end of each epoch. The constant prior distribution parameters are the same for all datasets and reported in Table 5. The complete set of the rest of the hyper-parameters can be found in Table 4. Table 4: Detailed hyperparameters used for the various experiments in the paper. This table includes Input Frames (Train/Inference), Posterior KP K, Prior KP Proposals L, Reconstruction Loss, βKL, Prior Patch Size, Glimpse Size S, Feature Dim d, and Epochs for each dataset.