ElasticTok: Adaptive Tokenization for Image and Video

Authors: Wilson Yan, Volodymyr Mnih, Aleksandra Faust, Matei Zaharia, Pieter Abbeel, Hao Liu

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our empirical evaluations on images and video demonstrate the effectiveness of our approach in efficient token usage, paving the way for future development of more powerful multimodal models, world models, and agents. In this section, we introduce our evaluation setup and present the results of pretraining Elastic Tok to adaptively represent images and videos, as well as its performance on downstream tasks.
Researcher Affiliation Collaboration UC Berkeley Google Deep Mind Carnegie Mellon. Pieter Abbeel holds concurrent appointments as a Professor at UC Berkeley and as an Amazon Scholar. This paper describes work performed at UC Berkeley and is not associated with Amazon.
Pseudocode Yes Exact details of our model s forward pass are described in Algorithm 1. Algorithm 2 provides more details on the exact inference process.
Open Source Code No Video examples of using Elastic Tok can be found on our website: largeworldmodel.github.io/elastictok. This URL points to a project website for examples, not explicitly for source code. The paper does not contain an unambiguous statement of code release.
Open Datasets Yes We train our long video models using v4-512 TPUs from Google Cloud on the COYO-700M image dataset and a custom dataset consisting of 6M videos scraped from the web. Image Data We use COYO-700M (Byeon et al., 2022) for our text-image data. We filter out images with less than 256 256 images. After accounting for stale links, we are left with roughly 350M text-image pairs. (Reference: Minwoo Byeon, Beomhee Park, Haecheon Kim, Sungjun Lee, Woonhyuk Baek, and Saehoon Kim. Coyo-700m: Image-text pair dataset. https://github.com/kakaobrain/coyo-dataset, 2022.)
Dataset Splits No The paper describes batch splits for training different models (e.g., "Batch Split (Image / Video) 100%/0%"), but does not provide explicit train/validation/test splits for the main COYO-700M or custom video datasets used for pre-training Elastic Tok.
Hardware Specification Yes We train our long video models using v4-512 TPUs from Google Cloud on the COYO-700M image dataset and a custom dataset consisting of 6M videos scraped from the web. We additionally train an image-only model on Image Net using v4-256 TPUs.
Software Dependencies No The paper mentions using various models and architectures like Open LLa MA-3B, FSQ, VAE, and ViTs, but does not provide specific version numbers for any software dependencies, libraries, or programming languages used for implementation.
Experiment Setup Yes The tables below showing trainin details for each of our models. Our Long Video model is trained on a mix of images and video (Batch Split), and each run (e.g. Long Video (2)) is initialized from the previous run (e.g. Long Video (1)). For the discrete (FSQ) model, each block has 4k tokens (256 blocks = 1M tokens), and the continuous (VAE) model has 2k tokens in each block (256 blocks = 512K tokens). (And the table in Appendix B listing Batch Size, Total Iterations, Learning Rate, Optimizer, Weight Decay, Warmup Iterations with specific values.)