ElasticTok: Adaptive Tokenization for Image and Video
Authors: Wilson Yan, Volodymyr Mnih, Aleksandra Faust, Matei Zaharia, Pieter Abbeel, Hao Liu
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our empirical evaluations on images and video demonstrate the effectiveness of our approach in efficient token usage, paving the way for future development of more powerful multimodal models, world models, and agents. In this section, we introduce our evaluation setup and present the results of pretraining Elastic Tok to adaptively represent images and videos, as well as its performance on downstream tasks. |
| Researcher Affiliation | Collaboration | UC Berkeley Google Deep Mind Carnegie Mellon. Pieter Abbeel holds concurrent appointments as a Professor at UC Berkeley and as an Amazon Scholar. This paper describes work performed at UC Berkeley and is not associated with Amazon. |
| Pseudocode | Yes | Exact details of our model s forward pass are described in Algorithm 1. Algorithm 2 provides more details on the exact inference process. |
| Open Source Code | No | Video examples of using Elastic Tok can be found on our website: largeworldmodel.github.io/elastictok. This URL points to a project website for examples, not explicitly for source code. The paper does not contain an unambiguous statement of code release. |
| Open Datasets | Yes | We train our long video models using v4-512 TPUs from Google Cloud on the COYO-700M image dataset and a custom dataset consisting of 6M videos scraped from the web. Image Data We use COYO-700M (Byeon et al., 2022) for our text-image data. We filter out images with less than 256 256 images. After accounting for stale links, we are left with roughly 350M text-image pairs. (Reference: Minwoo Byeon, Beomhee Park, Haecheon Kim, Sungjun Lee, Woonhyuk Baek, and Saehoon Kim. Coyo-700m: Image-text pair dataset. https://github.com/kakaobrain/coyo-dataset, 2022.) |
| Dataset Splits | No | The paper describes batch splits for training different models (e.g., "Batch Split (Image / Video) 100%/0%"), but does not provide explicit train/validation/test splits for the main COYO-700M or custom video datasets used for pre-training Elastic Tok. |
| Hardware Specification | Yes | We train our long video models using v4-512 TPUs from Google Cloud on the COYO-700M image dataset and a custom dataset consisting of 6M videos scraped from the web. We additionally train an image-only model on Image Net using v4-256 TPUs. |
| Software Dependencies | No | The paper mentions using various models and architectures like Open LLa MA-3B, FSQ, VAE, and ViTs, but does not provide specific version numbers for any software dependencies, libraries, or programming languages used for implementation. |
| Experiment Setup | Yes | The tables below showing trainin details for each of our models. Our Long Video model is trained on a mix of images and video (Batch Split), and each run (e.g. Long Video (2)) is initialized from the previous run (e.g. Long Video (1)). For the discrete (FSQ) model, each block has 4k tokens (256 blocks = 1M tokens), and the continuous (VAE) model has 2k tokens in each block (256 blocks = 512K tokens). (And the table in Appendix B listing Batch Size, Total Iterations, Learning Rate, Optimizer, Weight Decay, Warmup Iterations with specific values.) |