ImageFolder: Autoregressive Image Generation with Folded Tokens
Authors: Xiang Li, Kai Qiu, Hao Chen, Jason Kuen, Jiuxiang Gu, Bhiksha Raj, Zhe Lin
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments demonstrate the superior quality of image generation and shorter token length with Image Folder tokenizer. We test our Image Folder tokenizer on the Image Net 256x256 reconstruction and generation tasks. 4 EXPERIMENTS |
| Researcher Affiliation | Collaboration | Carnegie Mellon University1, Adobe Research2, MBZUAI3 |
| Pseudocode | No | The paper describes methods in prose and through architectural diagrams (e.g., Figure 1, Figure 3, Figure 5), but does not contain explicitly labeled 'Pseudocode' or 'Algorithm' blocks, nor does it present structured steps formatted like code. |
| Open Source Code | Yes | Project Page: Image Folder.github.io |
| Open Datasets | Yes | We test our Image Folder tokenizer on the Image Net 256x256 reconstruction and generation tasks. The Image Net dataset (Deng et al., 2009) is a large-scale visual database designed for use in visual object recognition research. |
| Dataset Splits | Yes | Its training set contains approximately 1.28 million images spanning 1,000 classes. Its validation set contains 50,000 images, with 50 images per class across the same 1,000 classes. |
| Hardware Specification | Yes | Time (s) 8.851 0.134 0.130 [...] on single A100 GPU. |
| Software Dependencies | No | The paper mentions using models like DINOv2 and GPT-2-based architectures, but does not provide specific version numbers for software libraries, programming languages, or other dependencies required for reproduction. |
| Experiment Setup | Yes | We use a cosine learning rate scheduler with a warmup for 1 epoch and a start learning rate of 3e-5. We set the quantizer drop ratio to 0.1. We set λclip = 0.1, λrecon = λV Q = λP = 1 and λad = 0.5. We set the residual quantizer scales to [1, 1, 2, 3, 3, 4, 5, 6, 8, 11] (in total 286 tokens). The codebook size for each tokenizer is set to 4096. |