High-Quality Joint Image and Video Tokenization with Causal VAE
Authors: Dawit Mureja Argaw, Xian Liu, Qinsheng Zhang, Joon Son Chung, Ming-Yu Liu, Fitsum Reda
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our approach outperforms competitors in video quality and compression rates across various datasets. Experimental analyses also highlight its potential as a robust autoencoder for video generation training. We compare our method with several state-of-the-art approaches... across multiple video and image benchmarks... using a comprehensive suite of metrics. Our experimental results demonstrate that the proposed autoencoder consistently outperforms the competing baselines... We also perform extensive ablation studies and experimental analyses to further confirm the benefits of the proposed autoencoder. |
| Researcher Affiliation | Collaboration | 1 Korea Advanced Institute of Science and Technology (KAIST) 2 NVIDIA |
| Pseudocode | No | The paper describes the architecture and methods in text and uses figures (e.g., Figure 1, Figure 2) to illustrate components and their connections, but does not include any explicit pseudocode or algorithm blocks. |
| Open Source Code | No | Code and models can be found here. The phrase "here" is not a clickable link or URL in the provided text, making the code inaccessible without further information. |
| Open Datasets | Yes | We use the Web Vid-2M (Bain et al., 2021) dataset for model training. We evaluate our model and competing approaches on the video autoencoding task using two representative datasets... Xiph-2K (Niklaus & Liu, 2020) and DAVIS (Pont-Tuset et al., 2017)... Additionally, we benchmark image autoencoding performance using the Image Net validation set (Russakovsky etg al., 2015)... We use the train-split of commonly used video synthesis benchmarks, Sky Timelapse (Zhang et al., 2020) and UCF-101 (Soomro, 2012), for model training. |
| Dataset Splits | No | We use the train-split of commonly used video synthesis benchmarks, Sky Timelapse (Zhang et al., 2020) and UCF-101 (Soomro, 2012), for model training. This indicates the use of a predefined training split, but specific percentages, sample counts for train/validation/test splits, or clear instructions for reproducing the split are not provided in the paper text. |
| Hardware Specification | Yes | Training is conducted for 250K iterations with a batch size of 48 on 48 NVIDIA A100 (40GB) GPUs. Our video generation experiments are conducted on 16 NVIDIA A100 (80GB) GPUs adhering to the training configuration in Zheng et al. (2024). |
| Software Dependencies | No | The paper mentions models like RAFT and frameworks like Open-Sora, and optimizers like Adam, but does not specify any software libraries or packages with their version numbers (e.g., PyTorch version, TensorFlow version, Python version). |
| Experiment Setup | Yes | For each step, we randomly sample T + 1 consecutive frames from a video, where T {8, 16}, and crop them to a size of 128 128. The FILM encoder uses an input pyramid with k = 3. GAN training begins after the initial 100K iterations with Lvae. The flow and KL regularization loss weights are set to αflow = 1e 3 and αKL = 1e 6, respectively. We use the Adam (Kingma & Ba, 2014) optimizer with a learning rate of 4.5e 5. Training is conducted for 250K iterations with a batch size of 48 on 48 NVIDIA A100 (40GB) GPUs. |