Highly Compressed Tokenizer Can Generate Without Training
Authors: Lukas Lao Beyer, Tianhong Li, Xinlei Chen, Sertac Karaman, Kaiming He
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Through a series of experiments, we demonstrate that simple latent space manipulations of tokens can result in image editing capabilities typically associated with generative models. For quantitative evaluation of editing and generation quality, we will consider a class-conditional generation pipeline based on a small seed image dataset subsampled from the Image Net training data, along with a set of CLIP text prompts used to guide generation towards target classes. |
| Researcher Affiliation | Collaboration | 1MIT LIDS 2MIT CSAIL 3Meta FAIR. Correspondence to: Lukas Lao Beyer <EMAIL>. |
| Pseudocode | Yes | Algorithm A1 Test-Time Optimization For CLIP-Guided Latent Editing. Input: img the seed image, and prompt a text prompt Output: recons the optimized image. Algorithm A2 Test-Time Optimization with Optional Tweaks. Input: img the seed image, ℓan objective function taking an image Output: recons the optimized image |
| Open Source Code | Yes | Code is available at https://github.com/ lukaslaobeyer/token-opt. |
| Open Datasets | Yes | For quantitative evaluation of editing and generation quality, we will consider a class-conditional generation pipeline based on a small seed image dataset subsampled from the Image Net training data, along with a set of CLIP text prompts used to guide generation towards target classes. A fixed number of Image Net ILSVRC2012 (Deng et al., 2009) training set images are randomly selected. |
| Dataset Splits | Yes | A fixed number of Image Net ILSVRC2012 (Deng et al., 2009) training set images are randomly selected. For each Image Net class, an equal number of images is sampled at random without replacement. The 1D tokens for images from this small seed image dataset are used to initialize the test-time token optimization. We generate 50k prompts, distributed according to the Image Net validation set class statistics. |
| Hardware Specification | Yes | Running 300 iterations of the text-guided image editing optimization (with CLIP loss smoothed over 8 random crops) in half precision using the VQ-LL-32 tokenizer takes 7 seconds per image on an NVIDIA A100. |
| Software Dependencies | No | The paper mentions 'Py Torch implementation' but does not specify any version numbers for PyTorch or other key software dependencies. |
| Experiment Setup | Yes | In practice, we use the Adam optimizer with a learning rate of 0.1, β1 = 0.9 and β2 = 0.999. We use a cosine schedule to ramp the noise from σ2 1 = 0.3 to σ2 200 = 0. Token regularization is highlighted in green. We obtain best results with λ = 0.02. Token EMA (not shown) uses a decay factor of 0.98. In our experiments, we set σinit = 0.3 as chosen for best performance from a sweep including {0.05, 0.3, 1.0}. |