CAGE: Unsupervised Visual Composition and Animation for Controllable Video Generation

Authors: Aram Davtyan, Sepehr Sameni, Björn Ommer, Paolo Favaro

AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We conduct extensive experiments to validate the effectiveness of CAGE across various scenarios, demonstrating its capability to accurately follow the control and to generate high-quality videos that exhibit coherent scene composition and realistic animation. In this section we conduct a series of experiments on several datasets to highlight different capabilities of CAGE and demonstrate its superiority over the prior work.
Researcher Affiliation Academia 1Computer Vision Group, Institute of Informatics, University of Bern, Switzerland 2Comp Vis @ LMU Munich and MCML, Germany EMAIL, EMAIL, EMAIL, EMAIL
Pseudocode No The paper describes the methodology using text and mathematical formulations (e.g., equations 1-6), but does not include any explicit pseudocode or algorithm blocks.
Open Source Code Yes Project website https://araachie.github.io/cage
Open Datasets Yes Ablations are conducted on the CLEVRER (Yi et al. 2019) dataset. ... We further test CAGE on the BAIR (Ebert et al. 2017) dataset... Finally, we test our model on real egocentric videos from the EPICKITCHENS (Damen et al. 2021) dataset.
Dataset Splits No For ablations we train CAGE with a smaller batch size of 16 samples... To this end, we have annotated 128 images from the test set of the CLEVRER dataset... Employing the optimal settings identified in the ablation studies, we train CAGE with the full batch size of 64 samples and report the results in Tab. 2. Besides the scene composition metrics studied in the ablations, following Davtyan and Favaro (2024), we reconstruct 15 frames of a video from a single initial one conditioned on the control for the first generated frame. ... Following prior work by Menapace et al. (Menapace et al. 2021), we trained CAGE on the BAIR dataset (Ebert et al. 2017). ... The experiment was conducted on the test videos from BAIR... The paper mentions using "training videos" and "test set" for CLEVRER and "test videos" for BAIR, and specifies the total size of training videos for CLEVRER (10k) and clips for BAIR (44k). However, it does not explicitly provide the train/validation/test split ratios or specific counts for all splits (e.g., how many test videos are there in CLEVRER's test set, or BAIR's). It refers to "test set" without defining its size or how it's created.
Hardware Specification No Due to limited computing resources, our experiments were conducted on relatively small datasets... The paper does not specify the exact hardware used for training or inference, such as GPU models, CPU types, or memory.
Software Dependencies No CAGE works in the latent space of a pre-trained VQGAN (Esser, Rombach, and Ommer 2021). All frames xn:n+k as well as xc are separately encoded to latents. Each latent is a feature map of shape c h w. These latents are first reshaped to (h w) c. Then a random set of tokens (but the same across time) is dropped from each of the latent codes, leaving m = (1 r) h w tokens per latent. Here r is the masking ratio that is typically equal to 0.4 in our experiments. The remaining tokens are then concatenated in the first dimension to form a single sequence of (k + 2) m tokens. This sequence is then passed to vt, which we model as a Transformer (Vaswani et al. 2017) that consists of 9 blocks of alternating spatial and temporal attention layers. ... For feature extraction on BAIR we found that it is optimal to use the 2 last layers of Vi T-B/14 variant of DINOv2. The experiment was conducted on the test videos from BAIR, utilizing an Euler solver with 50 steps as our ODE solver to ensure precise temporal evolution. ... The paper mentions models and architectures like VQGAN, DINOv2, Transformer, and RAFT, but does not specify software dependencies with version numbers (e.g., Python, PyTorch, TensorFlow versions).
Experiment Setup Yes For ablations we train CAGE with a smaller batch size of 16 samples. However, later we show that the performance of the model scales accordingly with larger batch sizes. ... Employing the optimal settings identified in the ablation studies, we train CAGE with the full batch size of 64 samples and report the results in Tab. 2. ... The experiment was conducted on the test videos from BAIR, utilizing an Euler solver with 50 steps as our ODE solver to ensure precise temporal evolution. ... we test different configurations of the model and demonstrate that the variant with 1 DINOv2 layer, π = 0.1, k = 3, with randomized ti and scale/position invariance performs the best on CLEVRER (see Table 1).