reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Learnings from Scaling Visual Tokenizers for Reconstruction and Generation

Authors: Philippe Hansen-Estruch, David Yan, Ching-Yao Chuang, Orr Zohar, Jialiang Wang, Tingbo Hou, Tao Xu, Sriram Vishwanath, Peter Vajda, Xinlei Chen

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	This paper investigates the impact of scaling autoencoders for reconstruction and generation by substituting the convolutional backbone with an enhanced Vision Transformer for Tokenization (Vi Tok). This paper s results show that scaling the auto-encoder bottleneck correlates with improved reconstruction, though its relationship with generative performance is more complex. In contrast, scaling the encoder does not lead to gains, while scaling the decoder enhances reconstruction with minimal effect on generation. These findings indicate that scaling the existing autoencoder paradigm does not significantly improve generative performance. When paired with Diffusion Transformers, Vi Tok achieves competitive image reconstruction & generation performance on 256p and 512p Image Net-1K. For videos, Vi Tok achieves state-of-the-art in both reconstruction & generation performance on 128p UCF-101.
Researcher Affiliation	Collaboration	1UT Austin 2Gen AI, Meta 3Stanford University 4Fundamental AI Research, Meta. Correspondence to: Philippe Hansen-Estruch <EMAIL>.
Pseudocode	No	The paper describes methods and architectures verbally and with figures (e.g., Figure 1), but does not include any explicitly labeled "Pseudocode" or "Algorithm" blocks. Structured steps are described within the main text.
Open Source Code	No	The paper references existing codebases like Video MAEv2 (Wang et al., 2023), Big Vision (Beyer et al., 2022), PyTorch (Paszke et al., 2019), Apollo (Zohar et al., 2024), Unified Masked Diffusion (Hansen-Estruch et al., 2024), and Video Occupancy Models (Tomar et al., 2024), stating that their implementation is based on or inspired by them. However, there is no explicit statement from the authors about releasing the source code for their specific Vi Tok implementation, nor is a direct link to a code repository provided for the work described in this paper.
Open Datasets	Yes	We train on large-scale datasets: Shutterstock (450M images) and Image Net-1K for images, and Shutterstock videos (30M videos) for video. Evaluation is performed on Image Net-1K, COCO-2017, UCF-101, and Kinetics-700. ... Vi Tok achieves competitive image reconstruction and generation performance on 256p and 512p Image Net-1K (Deng et al., 2009) and COCO (Lin et al., 2014) datasets... on 128p UCF-101 (Soomro, 2012) dataset.
Dataset Splits	Yes	Evaluation is performed on Image Net-1K, COCO-2017, UCF-101, and Kinetics-700. ...Performance trends are consistent across datasets, with minor r FID variations due to validation set sizes (50k for Image Net-1K vs 5k for COCO). ... For our video comparison, our reconstruction metrics are computed on the UCF-101 training set... We train a Di T-L model for 500K steps on the UCF-101 training set...
Hardware Specification	Yes	For image models, we train using 8 NVIDIA H100 GPUs, where Vi Tok S-B/16 requires approximately 6 12 hours for stage 1 and 3 6 hours for stage 2 on 256p and 512p resolutions. In comparison, Di T image models take around 72 96 hours to train for 4 million steps on the same resolutions. For video models, Vi Tok S-B/4x8 is trained on 16 NVIDIA H100 GPUs, taking about 24 hours for stage 1 and 12 hours for stage 2 on 256p, 16-frame videos, and 12 hours for 128p, 16-frame videos.
Software Dependencies	Yes	Our implementation is based on the Video MAEv2 (Wang et al., 2023) codebase and inspired by the Big Vision codebase (Beyer et al., 2022). Utilizing Py Torch (Paszke et al., 2019), we employ Distributed Data Parallel (DDP) for efficient multi-GPU training, along with activation checkpointing, bfloat16 precision, and Torch Compile optimizations.
Experiment Setup	Yes	To address instability in VAE frameworks, we use a twostage training approach. Stage 1 trains with MSE, LPIPS, and KL losses (β = 1 10 3, η = 1.0, λ = 0) for stable auto-encoding. Stage 2 incorporates the GAN, freezes the encoder, and fine-tunes the decoder with λ = 1.0. ... Stage 1 runs for 100k steps with batch sizes of 1024 (images) and 256 (videos). Stage 2 fine-tunes for another 100k steps. We use Adam W (β1 = 0.9, β2 = 0.95), a peak learning rate of 1 10 4 256 , weight decay of 1 10 4, and a cosine decay schedule. For Stage 2, we use Style GAN (Karras et al., 2019) discriminator with a learning rate of 2 10 5 and a 25k-step warmup. Training uses bfloat16 autocasting, with EMA (0.9999) introduced in Stage 2. ... we train a classconditional Di T-L (Peebles & Xie, 2023) with 400M parameters for 500,000 steps and a batch size of 256, applying classifier-free guidance (CFG) (Ho & Salimans, 2022) on a DDIM sampler (Song et al., 2020) over 250 steps and a CFG scale of 1.5.