LANTERN: Accelerating Visual Autoregressive Models with Relaxed Speculative Decoding
Authors: Doohyuk Jang, Sihwan Park, June Yong Yang, Yeonsung Jung, Jihun Yun, Souvik Kundu, Sung-Yub Kim, Eunho Yang
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experimental results demonstrate the efficacy of our method in providing a substantial speed-up over speculative decoding. In specific, compared to a naïve application of the state-of-the-art speculative decoding, LANTERN increases speed-ups by 1.75 and 1.82 , as compared to greedy decoding and random sampling, respectively, when applied to Llama Gen, a contemporary visual AR model. |
| Researcher Affiliation | Collaboration | 1KAIST 2Intel Labs 3AITRICS EMAIL |
| Pseudocode | Yes | D ALGORITHMS D.1 SPECULATIVE DECODING WITH LANTERN Algorithm 1 LANTERN D.2 PROXIMITY SET CONSTRUCTION Algorithm 2 Proximity Set Construction for LANTERN |
| Open Source Code | Yes | The code is publicly available at https://github.com/jadohu/LANTERN. |
| Open Datasets | Yes | We utilize the MS-COCO validation captions (Lin et al., 2014) to generate images and evaluate the image quality with the ground-truth images. To train the text-conditional model s drafter, we sampled 100k images in LAION-COCO dataset (Chuhmann et al., 2022), which is used to train Stage I target model. We used the same amount of image sampled in Image Net (Deng et al., 2009) dataset to train the class-conditional model s drafter. |
| Dataset Splits | Yes | We utilize the MS-COCO validation captions (Lin et al., 2014) to generate images... For the assessment of speed-ups, we use 1000 MS-COCO validation captions... During training, 5% of data is set to be held out validation dataset. |
| Hardware Specification | Yes | The actual speed-up is measured on a single RTX 3090. (Table 2) The actual speed-up are measured on a single Intel Gaudi 2 (96GB) accelerator and NVIDIA RTX 3090. |
| Software Dependencies | No | The paper mentions specific optimizers and models (e.g., Adam W (Loshchilov & Hutter, 2019), Flan-T5 XL (Chung et al., 2022)) but does not provide version numbers for these or for any general programming languages or deep learning frameworks. |
| Experiment Setup | Yes | The batch size is 16, and the base learning rate is 10 4. Adam W (Loshchilov & Hutter, 2019) optimizer with β1 = 0.9 and β2 = 0.95 is used, and Linear learning rate scheduling with warm-up is used with 2000 warm-up steps. We select the best-performing model in terms of top-3 accuracy in the hold-out validation set for 20 epochs. For Llama Gen (Sun et al., 2024) stage I and stage II, images are generated using a classifier-free guidance scale of 7.5 with top-p set to 1.0 and top-k set to 1000... For a class-conditional generation, the classifier-free guidance scale is set to 4.0, with the top-k sampling covering the entire vocabulary and the top-p sampling set to 1.0. For Anole (Chern et al., 2024), we use a classifier-free guidance scale of 3.0 with with top-k as 2000. For EAGLE-2 and our method, 60 candidate tokens are passed into the target model for each verification process. |