Accelerating Auto-regressive Text-to-Image Generation with Training-free Speculative Jacobi Decoding
Authors: Yao Teng, Han Shi, Xian Liu, Xuefei Ning, Guohao Dai, Yu Wang, Zhenguo Li, Xihui Liu
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We conduct experiments for our proposed SJD on multiple auto-regressive text-to-image generation models, showing the effectiveness of model acceleration without sacrificing the visual quality. We perform quantitative and qualitative experiments to demonstrate the effectiveness of our method. Results show that our method can accelerate several auto-regressive text-to-image generation models without sacrificing the quality of generated images. For example, it can accelerate Anole (Chern et al., 2024) and Lumina-m GPT (Liu et al., 2024b) by about 2 with almost no loss in visual quality. Metrics. For visual quality, we use FID (Heusel et al., 2017) and CLIP-Score (Radford et al., 2021) as the metrics for evaluation. We use the step compression ratio (Fu et al., 2024b): S = # generated tokens # decoding steps to show the theoretical acceleration ratio. We perform ablation studies on Lumina-m GPT 7B. |
| Researcher Affiliation | Collaboration | 1The University of Hong Kong 2Huawei Noah s Ark Lab 3CUHK 4Tsinghua University 5Shanghai Jiao Tong University 6Infinigence AI |
| Pseudocode | No | The paper describes the Speculative Jacobi Decoding algorithm in detail using prose and mathematical formulations (e.g., equations for acceptance criteria and resampling), and figures (Fig. 3 and Fig. 4 illustrate the process), but it does not present a distinct section or block explicitly labeled as 'Pseudocode' or 'Algorithm'. |
| Open Source Code | Yes | The code of our work is available here: https: //github.com/tyshiwo1/Accelerating-T2I-AR-with-SJD/. |
| Open Datasets | Yes | The parti-prompts (Yu et al., 2022) and the validation set of MS-COCO 2017 (Lin et al., 2014) are taken as the benchmarks of image generation. On parti-prompts, we use the CLIP-Score and the acceleration of latency and steps excluding FID for evaluation because this benchmark only provides prompts without ground-truth images. We also evaluate our method with the text-to-image Llama Gen (Sun et al., 2024a). This model adopts a two-stage training strategy: (a) stage1: Llama Gen is first trained on a subset of LAION-COCO (LAION, 2022) (50M 256 256 images); (b) stage2: it is then fine-tuned on 10M high aesthetic quality internal data with a resolution of 512 512. |
| Dataset Splits | Yes | The parti-prompts (Yu et al., 2022) and the validation set of MS-COCO 2017 (Lin et al., 2014) are taken as the benchmarks of image generation. We experiment with two recent and representative auto-regressive text-to-image generation models, Lumina-m GPT (Liu et al., 2024b) and Anole (Chern et al., 2024). |
| Hardware Specification | Yes | The evaluation on the validation set of MSCOCO2017 with A100. The evaluation on the validation set of Parti-prompt with RTX4090. For 768 768 image generation (the number of generated tokens is at least 2357), we perform the experiments on one RTX 4090 GPU. For 1024 1024 image generation (the number of generated tokens is at least 4165), we perform the experiments on one A100 GPU. |
| Software Dependencies | No | The paper does not explicitly list any specific software dependencies with version numbers (e.g., Python, PyTorch, CUDA versions) that are needed to replicate the experiment. |
| Experiment Setup | Yes | Following the basic setting of Lumina-m GPT, K is set to 2000 and the classifier-free guidance weight is set to 3.0. We calculate the average step compression ratio for each resolution given the same set of text prompts. We perform the ablation studies on the size of the window. The results show that our acceleration ratio reaches almost the maximum when the number of input tokens is greater than or equal to 16 tokens. We adopt an extreme case, the textual prompt 2D logo of a pure white box in a pure black background , for evaluation. We run the accelerated forward passes ten times with different random seeds for each initialization. |