reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

ElasticTok: Adaptive Tokenization for Image and Video

Authors: Wilson Yan, Volodymyr Mnih, Aleksandra Faust, Matei Zaharia, Pieter Abbeel, Hao Liu

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our empirical evaluations on images and video demonstrate the effectiveness of our approach in efficient token usage, paving the way for future development of more powerful multimodal models, world models, and agents. In this section, we introduce our evaluation setup and present the results of pretraining Elastic Tok to adaptively represent images and videos, as well as its performance on downstream tasks.
Researcher Affiliation	Collaboration	UC Berkeley Google Deep Mind Carnegie Mellon. Pieter Abbeel holds concurrent appointments as a Professor at UC Berkeley and as an Amazon Scholar. This paper describes work performed at UC Berkeley and is not associated with Amazon.
Pseudocode	Yes	Exact details of our model s forward pass are described in Algorithm 1. Algorithm 2 provides more details on the exact inference process.
Open Source Code	No	Video examples of using Elastic Tok can be found on our website: largeworldmodel.github.io/elastictok. This URL points to a project website for examples, not explicitly for source code. The paper does not contain an unambiguous statement of code release.
Open Datasets	Yes	We train our long video models using v4-512 TPUs from Google Cloud on the COYO-700M image dataset and a custom dataset consisting of 6M videos scraped from the web. Image Data We use COYO-700M (Byeon et al., 2022) for our text-image data. We filter out images with less than 256 256 images. After accounting for stale links, we are left with roughly 350M text-image pairs. (Reference: Minwoo Byeon, Beomhee Park, Haecheon Kim, Sungjun Lee, Woonhyuk Baek, and Saehoon Kim. Coyo-700m: Image-text pair dataset. https://github.com/kakaobrain/coyo-dataset, 2022.)
Dataset Splits	No	The paper describes batch splits for training different models (e.g., "Batch Split (Image / Video) 100%/0%"), but does not provide explicit train/validation/test splits for the main COYO-700M or custom video datasets used for pre-training Elastic Tok.
Hardware Specification	Yes	We train our long video models using v4-512 TPUs from Google Cloud on the COYO-700M image dataset and a custom dataset consisting of 6M videos scraped from the web. We additionally train an image-only model on Image Net using v4-256 TPUs.
Software Dependencies	No	The paper mentions using various models and architectures like Open LLa MA-3B, FSQ, VAE, and ViTs, but does not provide specific version numbers for any software dependencies, libraries, or programming languages used for implementation.
Experiment Setup	Yes	The tables below showing trainin details for each of our models. Our Long Video model is trained on a mix of images and video (Batch Split), and each run (e.g. Long Video (2)) is initialized from the previous run (e.g. Long Video (1)). For the discrete (FSQ) model, each block has 4k tokens (256 blocks = 1M tokens), and the continuous (VAE) model has 2k tokens in each block (256 blocks = 512K tokens). (And the table in Appendix B listing Batch Size, Total Iterations, Learning Rate, Optimizer, Weight Decay, Warmup Iterations with specific values.)