reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

High-Quality Joint Image and Video Tokenization with Causal VAE

Authors: Dawit Mureja Argaw, Xian Liu, Qinsheng Zhang, Joon Son Chung, Ming-Yu Liu, Fitsum Reda

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our approach outperforms competitors in video quality and compression rates across various datasets. Experimental analyses also highlight its potential as a robust autoencoder for video generation training. We compare our method with several state-of-the-art approaches... across multiple video and image benchmarks... using a comprehensive suite of metrics. Our experimental results demonstrate that the proposed autoencoder consistently outperforms the competing baselines... We also perform extensive ablation studies and experimental analyses to further confirm the benefits of the proposed autoencoder.
Researcher Affiliation	Collaboration	1 Korea Advanced Institute of Science and Technology (KAIST) 2 NVIDIA
Pseudocode	No	The paper describes the architecture and methods in text and uses figures (e.g., Figure 1, Figure 2) to illustrate components and their connections, but does not include any explicit pseudocode or algorithm blocks.
Open Source Code	No	Code and models can be found here. The phrase "here" is not a clickable link or URL in the provided text, making the code inaccessible without further information.
Open Datasets	Yes	We use the Web Vid-2M (Bain et al., 2021) dataset for model training. We evaluate our model and competing approaches on the video autoencoding task using two representative datasets... Xiph-2K (Niklaus & Liu, 2020) and DAVIS (Pont-Tuset et al., 2017)... Additionally, we benchmark image autoencoding performance using the Image Net validation set (Russakovsky etg al., 2015)... We use the train-split of commonly used video synthesis benchmarks, Sky Timelapse (Zhang et al., 2020) and UCF-101 (Soomro, 2012), for model training.
Dataset Splits	No	We use the train-split of commonly used video synthesis benchmarks, Sky Timelapse (Zhang et al., 2020) and UCF-101 (Soomro, 2012), for model training. This indicates the use of a predefined training split, but specific percentages, sample counts for train/validation/test splits, or clear instructions for reproducing the split are not provided in the paper text.
Hardware Specification	Yes	Training is conducted for 250K iterations with a batch size of 48 on 48 NVIDIA A100 (40GB) GPUs. Our video generation experiments are conducted on 16 NVIDIA A100 (80GB) GPUs adhering to the training configuration in Zheng et al. (2024).
Software Dependencies	No	The paper mentions models like RAFT and frameworks like Open-Sora, and optimizers like Adam, but does not specify any software libraries or packages with their version numbers (e.g., PyTorch version, TensorFlow version, Python version).
Experiment Setup	Yes	For each step, we randomly sample T + 1 consecutive frames from a video, where T {8, 16}, and crop them to a size of 128 128. The FILM encoder uses an input pyramid with k = 3. GAN training begins after the initial 100K iterations with Lvae. The flow and KL regularization loss weights are set to αflow = 1e 3 and αKL = 1e 6, respectively. We use the Adam (Kingma & Ba, 2014) optimizer with a learning rate of 4.5e 5. Training is conducted for 250K iterations with a batch size of 48 on 48 NVIDIA A100 (40GB) GPUs.