Learnings from Scaling Visual Tokenizers for Reconstruction and Generation
Authors: Philippe Hansen-Estruch, David Yan, Ching-Yao Chuang, Orr Zohar, Jialiang Wang, Tingbo Hou, Tao Xu, Sriram Vishwanath, Peter Vajda, Xinlei Chen
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | This paper investigates the impact of scaling autoencoders for reconstruction and generation by substituting the convolutional backbone with an enhanced Vision Transformer for Tokenization (Vi Tok). This paper s results show that scaling the auto-encoder bottleneck correlates with improved reconstruction, though its relationship with generative performance is more complex. In contrast, scaling the encoder does not lead to gains, while scaling the decoder enhances reconstruction with minimal effect on generation. These findings indicate that scaling the existing autoencoder paradigm does not significantly improve generative performance. When paired with Diffusion Transformers, Vi Tok achieves competitive image reconstruction & generation performance on 256p and 512p Image Net-1K. For videos, Vi Tok achieves state-of-the-art in both reconstruction & generation performance on 128p UCF-101. |
| Researcher Affiliation | Collaboration | 1UT Austin 2Gen AI, Meta 3Stanford University 4Fundamental AI Research, Meta. Correspondence to: Philippe Hansen-Estruch <EMAIL>. |
| Pseudocode | No | The paper describes methods and architectures verbally and with figures (e.g., Figure 1), but does not include any explicitly labeled "Pseudocode" or "Algorithm" blocks. Structured steps are described within the main text. |
| Open Source Code | No | The paper references existing codebases like Video MAEv2 (Wang et al., 2023), Big Vision (Beyer et al., 2022), PyTorch (Paszke et al., 2019), Apollo (Zohar et al., 2024), Unified Masked Diffusion (Hansen-Estruch et al., 2024), and Video Occupancy Models (Tomar et al., 2024), stating that their implementation is based on or inspired by them. However, there is no explicit statement from the authors about releasing the source code for their specific Vi Tok implementation, nor is a direct link to a code repository provided for the work described in this paper. |
| Open Datasets | Yes | We train on large-scale datasets: Shutterstock (450M images) and Image Net-1K for images, and Shutterstock videos (30M videos) for video. Evaluation is performed on Image Net-1K, COCO-2017, UCF-101, and Kinetics-700. ... Vi Tok achieves competitive image reconstruction and generation performance on 256p and 512p Image Net-1K (Deng et al., 2009) and COCO (Lin et al., 2014) datasets... on 128p UCF-101 (Soomro, 2012) dataset. |
| Dataset Splits | Yes | Evaluation is performed on Image Net-1K, COCO-2017, UCF-101, and Kinetics-700. ...Performance trends are consistent across datasets, with minor r FID variations due to validation set sizes (50k for Image Net-1K vs 5k for COCO). ... For our video comparison, our reconstruction metrics are computed on the UCF-101 training set... We train a Di T-L model for 500K steps on the UCF-101 training set... |
| Hardware Specification | Yes | For image models, we train using 8 NVIDIA H100 GPUs, where Vi Tok S-B/16 requires approximately 6 12 hours for stage 1 and 3 6 hours for stage 2 on 256p and 512p resolutions. In comparison, Di T image models take around 72 96 hours to train for 4 million steps on the same resolutions. For video models, Vi Tok S-B/4x8 is trained on 16 NVIDIA H100 GPUs, taking about 24 hours for stage 1 and 12 hours for stage 2 on 256p, 16-frame videos, and 12 hours for 128p, 16-frame videos. |
| Software Dependencies | Yes | Our implementation is based on the Video MAEv2 (Wang et al., 2023) codebase and inspired by the Big Vision codebase (Beyer et al., 2022). Utilizing Py Torch (Paszke et al., 2019), we employ Distributed Data Parallel (DDP) for efficient multi-GPU training, along with activation checkpointing, bfloat16 precision, and Torch Compile optimizations. |
| Experiment Setup | Yes | To address instability in VAE frameworks, we use a twostage training approach. Stage 1 trains with MSE, LPIPS, and KL losses (β = 1 10 3, η = 1.0, λ = 0) for stable auto-encoding. Stage 2 incorporates the GAN, freezes the encoder, and fine-tunes the decoder with λ = 1.0. ... Stage 1 runs for 100k steps with batch sizes of 1024 (images) and 256 (videos). Stage 2 fine-tunes for another 100k steps. We use Adam W (β1 = 0.9, β2 = 0.95), a peak learning rate of 1 10 4 256 , weight decay of 1 10 4, and a cosine decay schedule. For Stage 2, we use Style GAN (Karras et al., 2019) discriminator with a learning rate of 2 10 5 and a 25k-step warmup. Training uses bfloat16 autocasting, with EMA (0.9999) introduced in Stage 2. ... we train a classconditional Di T-L (Peebles & Xie, 2023) with 400M parameters for 500,000 steps and a batch size of 256, applying classifier-free guidance (CFG) (Ho & Salimans, 2022) on a DDIM sampler (Song et al., 2020) over 250 steps and a CFG scale of 1.5. |