Highly Compressed Tokenizer Can Generate Without Training

Authors: Lukas Lao Beyer, Tianhong Li, Xinlei Chen, Sertac Karaman, Kaiming He

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Through a series of experiments, we demonstrate that simple latent space manipulations of tokens can result in image editing capabilities typically associated with generative models. For quantitative evaluation of editing and generation quality, we will consider a class-conditional generation pipeline based on a small seed image dataset subsampled from the Image Net training data, along with a set of CLIP text prompts used to guide generation towards target classes.
Researcher Affiliation Collaboration 1MIT LIDS 2MIT CSAIL 3Meta FAIR. Correspondence to: Lukas Lao Beyer <EMAIL>.
Pseudocode Yes Algorithm A1 Test-Time Optimization For CLIP-Guided Latent Editing. Input: img the seed image, and prompt a text prompt Output: recons the optimized image. Algorithm A2 Test-Time Optimization with Optional Tweaks. Input: img the seed image, ℓan objective function taking an image Output: recons the optimized image
Open Source Code Yes Code is available at https://github.com/ lukaslaobeyer/token-opt.
Open Datasets Yes For quantitative evaluation of editing and generation quality, we will consider a class-conditional generation pipeline based on a small seed image dataset subsampled from the Image Net training data, along with a set of CLIP text prompts used to guide generation towards target classes. A fixed number of Image Net ILSVRC2012 (Deng et al., 2009) training set images are randomly selected.
Dataset Splits Yes A fixed number of Image Net ILSVRC2012 (Deng et al., 2009) training set images are randomly selected. For each Image Net class, an equal number of images is sampled at random without replacement. The 1D tokens for images from this small seed image dataset are used to initialize the test-time token optimization. We generate 50k prompts, distributed according to the Image Net validation set class statistics.
Hardware Specification Yes Running 300 iterations of the text-guided image editing optimization (with CLIP loss smoothed over 8 random crops) in half precision using the VQ-LL-32 tokenizer takes 7 seconds per image on an NVIDIA A100.
Software Dependencies No The paper mentions 'Py Torch implementation' but does not specify any version numbers for PyTorch or other key software dependencies.
Experiment Setup Yes In practice, we use the Adam optimizer with a learning rate of 0.1, β1 = 0.9 and β2 = 0.999. We use a cosine schedule to ramp the noise from σ2 1 = 0.3 to σ2 200 = 0. Token regularization is highlighted in green. We obtain best results with λ = 0.02. Token EMA (not shown) uses a decay factor of 0.98. In our experiments, we set σinit = 0.3 as chosen for best performance from a sweep including {0.05, 0.3, 1.0}.