Adaptive Length Image Tokenization via Recurrent Allocation

Authors: Shivam Duggal, Phillip Isola, Antonio Torralba, William Freeman

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We validate our tokenizer using reconstruction loss and FID metrics, demonstrating that token count aligns with image entropy, familiarity and downstream task requirements. Recurrent token processing with increasing representational capacity in each iteration shows signs of token specialization, revealing potential for object / part discovery.
Researcher Affiliation Academia Shivam Duggal Phillip Isola Antonio Torralba William T. Freeman MIT CSAIL
Pseudocode No The paper describes the architecture and process in text and diagrams (Figure 1 and Figure 10), but does not include any explicitly labeled pseudocode blocks or algorithms.
Open Source Code Yes Code available at https://github.com/ Shivam Duggal4/adaptive-length-tokenizer.
Open Datasets Yes We validate the effectiveness of the learned tokenizer by demonstrating comparable reconstruction metrics (L1 loss and FID) on multiple datasets (IN, COCO, Places, Art-dataset and even randomly selected internet images Fig. 20) and linear probing results on Image Net-1K, relative to the 2D VQGAN tokenizer (Esser et al., 2020) and fixed-latent 1D tokenizer, Titok (Yu et al., 2024). For training the adaptive tokenizer, we mainly utilize Imagenet-100 (100 classes of Image Net-1K as used in (Wang & Isola, 2022)) and Image Net-1K datasets. Fig. 3, Fig. 11, Fig. 12 all leverage data from SAVOIAS dataset, which again is different from Image Net-100 classes. Plots in Fig. 4, Fig. 5, Fig. 14, Fig. 13 all leverage Image Net-100 validation set. Fig. 7 and Fig. 8 use Image Net images for token visualzation, Fig. 15 and Fig. 16) use COCO images. Fig. 17 and Fig. 18 showcase randomly sampled images from Image Net-100 validation set. OOD indoor scene image is from NYUV2 dataset (Nathan Silberman & Fergus, 2012) and tree example is from WIT dataset (Srinivasan et al., 2021).
Dataset Splits Yes For training the adaptive tokenizer, we mainly utilize Imagenet-100 (100 classes of Image Net-1K as used in (Wang & Isola, 2022)) and Image Net-1K datasets. Fig. 4, Fig. 5, Fig. 14, Fig. 13 all leverage Image Net-100 validation set. We visualize the frequency with which each code in the learned codebook is sampled across 5K validation images in Fig. 22.
Hardware Specification Yes Table 8: Iteration-wise Inference Time (ms) Comparison on a single H100 GPU with FP32 precision for single-image encoding.
Software Dependencies No The paper mentions various models and frameworks like VQGAN, VAE, ResNet-18, Depth Anything V2, VGG16, AlexNet, Blip model, but does not specify exact version numbers for any software libraries or frameworks used in the implementation.
Experiment Setup Yes Training Details In this section, we provide additional training details (see Sec. 3 for approach details and training procedure summary). We train ALIT in two phases latent-distillation pre-training and full fine-tuning stage (with gan loss). In the latent-distillation pre-training stage, we leverage a pre-trained image tokenizer (VQGAN or VAE) which maps an input image to 2D tokens. We only train the latent-distillation encoder-decoder modules in this stage, using image token reconstruction loss as the core-learning objective. With VQGAN as base tokenzier, we use cross-entropy loss comparing predicted logits with the ground-truth VQGAN-codebook index at each 2D token position. We use mean-squre reconstruction loss when using VAE as the basetokenizer. We unroll the recurrent token allocation procedure for 8 iterations, expanding token memory from 32 (in 1st iteration) to 256 (in 8th) during training. All the recurrent rollouts are trained end-to-end. At each iteration, we process the image-tokens, the existing 1D latent tokens and add new latent tokens. During this training phase, we perform dynamic halting of the image tokens in each iteration, allowing the latent-distillation modules to focus on distilling image tokens which cannot be reconstructed perfectly till current iterations. We use transformer backbones for both latent-distillation encoder and decoder, performing self-attention among 2D image tokens and latent 1D tokens. In the next training phase, we jointly fine-tune both the base image tokenizer modules and the latent-distillation encoder-decoder modules with losses directly at the pixel-space. The training objectives are pixel-level reconstruction and adversarial losses (gan generator and discriminator losses) inspired from VQGAN (Esser et al., 2020) training procedure. We optimize for reconstruction loss for first few epochs, later switching to both reconstruction and adversarial losses.