High-Quality Joint Image and Video Tokenization with Causal VAE

Authors: Dawit Mureja Argaw, Xian Liu, Qinsheng Zhang, Joon Son Chung, Ming-Yu Liu, Fitsum Reda

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our approach outperforms competitors in video quality and compression rates across various datasets. Experimental analyses also highlight its potential as a robust autoencoder for video generation training. We compare our method with several state-of-the-art approaches... across multiple video and image benchmarks... using a comprehensive suite of metrics. Our experimental results demonstrate that the proposed autoencoder consistently outperforms the competing baselines... We also perform extensive ablation studies and experimental analyses to further confirm the benefits of the proposed autoencoder.
Researcher Affiliation Collaboration 1 Korea Advanced Institute of Science and Technology (KAIST) 2 NVIDIA
Pseudocode No The paper describes the architecture and methods in text and uses figures (e.g., Figure 1, Figure 2) to illustrate components and their connections, but does not include any explicit pseudocode or algorithm blocks.
Open Source Code No Code and models can be found here. The phrase "here" is not a clickable link or URL in the provided text, making the code inaccessible without further information.
Open Datasets Yes We use the Web Vid-2M (Bain et al., 2021) dataset for model training. We evaluate our model and competing approaches on the video autoencoding task using two representative datasets... Xiph-2K (Niklaus & Liu, 2020) and DAVIS (Pont-Tuset et al., 2017)... Additionally, we benchmark image autoencoding performance using the Image Net validation set (Russakovsky etg al., 2015)... We use the train-split of commonly used video synthesis benchmarks, Sky Timelapse (Zhang et al., 2020) and UCF-101 (Soomro, 2012), for model training.
Dataset Splits No We use the train-split of commonly used video synthesis benchmarks, Sky Timelapse (Zhang et al., 2020) and UCF-101 (Soomro, 2012), for model training. This indicates the use of a predefined training split, but specific percentages, sample counts for train/validation/test splits, or clear instructions for reproducing the split are not provided in the paper text.
Hardware Specification Yes Training is conducted for 250K iterations with a batch size of 48 on 48 NVIDIA A100 (40GB) GPUs. Our video generation experiments are conducted on 16 NVIDIA A100 (80GB) GPUs adhering to the training configuration in Zheng et al. (2024).
Software Dependencies No The paper mentions models like RAFT and frameworks like Open-Sora, and optimizers like Adam, but does not specify any software libraries or packages with their version numbers (e.g., PyTorch version, TensorFlow version, Python version).
Experiment Setup Yes For each step, we randomly sample T + 1 consecutive frames from a video, where T {8, 16}, and crop them to a size of 128 128. The FILM encoder uses an input pyramid with k = 3. GAN training begins after the initial 100K iterations with Lvae. The flow and KL regularization loss weights are set to αflow = 1e 3 and αKL = 1e 6, respectively. We use the Adam (Kingma & Ba, 2014) optimizer with a learning rate of 4.5e 5. Training is conducted for 250K iterations with a batch size of 48 on 48 NVIDIA A100 (40GB) GPUs.