FlexTok: Resampling Images into 1D Token Sequences of Flexible Length
Authors: Roman Bachmann, Jesse Allardice, David Mizrahi, Enrico Fini, Oğuzhan Fatih Kar, Elmira Amirloo, Alaaeldin El-Nouby, Amir Zamir, Afshin Dehghan
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate our approach in an autoregressive generation setting using a simple GPT-style Transformer. On Image Net, this approach achieves an FID < 2 across 8 to 128 tokens, outperforming Ti Tok and matching state-of-the-art methods with far fewer tokens. We further extend the model to support to text-conditioned image generation and examine how Flex Tok relates to traditional 2D tokenization. A key finding is that Flex Tok enables next-token prediction to describe images in a coarse-to-fine visual vocabulary , and that the number of tokens to generate depends on the complexity of the generation task. |
| Researcher Affiliation | Collaboration | 1Apple 2Swiss Federal Institute of Technology Lausanne (EPFL). Correspondence to: Roman Bachmann <EMAIL>, Jesse Allardice <EMAIL>, David Mizrahi <EMAIL>. |
| Pseudocode | No | The paper describes the methodology using text, diagrams (e.g., Figure 3), and mathematical formulations, but it does not contain any explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | No | The paper includes a URL (https://flextok.epfl.ch) which appears to be a project page, but it does not explicitly state that this URL leads to the source code repository for the methodology described in the paper, nor does it provide a direct link to a code repository or state that code is available in supplementary materials. |
| Open Datasets | Yes | On Image Net, this approach achieves an FID < 2 across 8 to 128 tokens, outperforming Ti Tok and matching state-of-the-art methods with far fewer tokens... We further extend the model to support to text-conditioned image generation and examine how Flex Tok relates to traditional 2D tokenization... To that end, we train autoregressive Transformers to perform class-conditional generation on Image Net-1k (Russakovsky et al., 2014), following Llama Gen (Sun et al., 2024), and text-to-image generation on DFN-2B (Fang et al., 2023). |
| Dataset Splits | Yes | We evaluate its reconstruction performance on nested token sequences of different lengths. We perform comparisons using Flex Tok models trained on Image Net-1k, testing them on 256x256 pixel crops from the validation set (Russakovsky et al., 2014)... For evaluation of the class-conditioned image generation results, we follow the common practice of measuring the generation FID (g FID) of 50K generated samples relative to the reference statistics calculated over the entire training split of the Image Net-1k dataset (Dhariwal & Nichol, 2021). |
| Hardware Specification | Yes | Using the optimal hyperparameters found (learning rate 5e-4, weight decay 0.05, crop scale [0.4, 1.0]), we perform a sweep over the number of register tokens to collect the results shown in Figure 16. These experiments are conducted for all models (Flex Tok d12-d12, Flex Tok d18-d18, Flex Tok d18-d28) with a batch size of 1024, using 8 H100 or A100 GPUs for each experiment. |
| Software Dependencies | No | The paper mentions "Flex Attention (Py Torch Team: Horace He, Driss Guessous, Yanbo Liang, Joy Dong, 2024)" and data type "bfloat16 (Burgess et al., 2019)", but it does not provide a list of specific software dependencies with version numbers like Python, PyTorch, or CUDA versions. |
| Experiment Setup | Yes | We break down the implementation into three distinct stages. In Stage 0, we train VAE models (Rombach et al., 2022) with continuous latents... All subsequent experiments use the 16 channel VAE with a downsampling factor of 8... The Flex Tok architecture consists of a Transformer encoder and decoder using a maximum of 256 registers tokens. After applying a 6-dimensional FSQ (Mentzer et al., 2023) bottleneck with levels [8, 8, 8, 5, 5, 5] (for an effective vocabulary size of 64 000)... All models are trained at a resolution of 256x256 pixels... We train three Flex Tok versions with different encoder and decoder sizes (separated by a hyphen), d12-d12, d18-d18 and d18-d28... We use ada LN-zero (Peebles & Xie, 2023) to condition the patches and registers separately on the current timestep, and REPA (Yu et al., 2024b) with DINOv2-L (Oquab et al., 2023) features to speed up convergence... See Table 5 for a detailed breakdown of the resampler tokenizer architecture and training settings. |