ImageFolder: Autoregressive Image Generation with Folded Tokens

Authors: Xiang Li, Kai Qiu, Hao Chen, Jason Kuen, Jiuxiang Gu, Bhiksha Raj, Zhe Lin

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments demonstrate the superior quality of image generation and shorter token length with Image Folder tokenizer. We test our Image Folder tokenizer on the Image Net 256x256 reconstruction and generation tasks. 4 EXPERIMENTS
Researcher Affiliation Collaboration Carnegie Mellon University1, Adobe Research2, MBZUAI3
Pseudocode No The paper describes methods in prose and through architectural diagrams (e.g., Figure 1, Figure 3, Figure 5), but does not contain explicitly labeled 'Pseudocode' or 'Algorithm' blocks, nor does it present structured steps formatted like code.
Open Source Code Yes Project Page: Image Folder.github.io
Open Datasets Yes We test our Image Folder tokenizer on the Image Net 256x256 reconstruction and generation tasks. The Image Net dataset (Deng et al., 2009) is a large-scale visual database designed for use in visual object recognition research.
Dataset Splits Yes Its training set contains approximately 1.28 million images spanning 1,000 classes. Its validation set contains 50,000 images, with 50 images per class across the same 1,000 classes.
Hardware Specification Yes Time (s) 8.851 0.134 0.130 [...] on single A100 GPU.
Software Dependencies No The paper mentions using models like DINOv2 and GPT-2-based architectures, but does not provide specific version numbers for software libraries, programming languages, or other dependencies required for reproduction.
Experiment Setup Yes We use a cosine learning rate scheduler with a warmup for 1 epoch and a start learning rate of 3e-5. We set the quantizer drop ratio to 0.1. We set λclip = 0.1, λrecon = λV Q = λP = 1 and λad = 0.5. We set the residual quantizer scales to [1, 1, 2, 3, 3, 4, 5, 6, 8, 11] (in total 286 tokens). The codebook size for each tokenizer is set to 4096.