Efficient Inference for Large Language Model-based Generative Recommendation
Authors: Xinyu Lin, Chaoqun Yang, Wenjie Wang, Yongqi Li, Cunxiao Du, Fuli Feng, See-Kiong Ng, Tat-Seng Chua
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Empirical results on two real-world datasets demonstrate that At Speed significantly accelerates LLM-based generative recommendation, e.g., near 2 speedup under strict top-K verification and up to 2.5 speedup under relaxed sampling verification. We conduct extensive experiments using both verification strategies on two real-world recommendation datasets, demonstrating that At Speed significantly accelerates the decoding for LLM-based recommendation (around 2 speedup). |
| Researcher Affiliation | Academia | 1National University of Singapore 2Tsinghua University 3University of Science and Technology of China 4The Hong Kong Polytechnic University EMAIL, EMAIL, EMAIL, EMAIL, EMAIL |
| Pseudocode | Yes | Algorithm 1 SD step with Top-K Strict Verification Algorithm 2 SD step with Relaxed Sampling Verification |
| Open Source Code | Yes | The codes and datasets are available at https://github.com/Linxyhaha/At Speed. |
| Open Datasets | Yes | To evaluate our proposed framework, we instantiate At Speed on a SOTA LLM-based generative recommender model LC-Rec (Zheng et al., 2024) and test on two real-world recommendation datasets6 from the popular benchmark Amazon review datasets7. 1) Beauty contains user interactions with the beauty products and 2) Games collects the user interactions with the video games. Footnote 7: https://cseweb.ucsd.edu/~jmcauley/datasets/amazon_v2/. |
| Dataset Splits | Yes | For both Beauty and Games, all interactions are sorted according to the global timestamps, and then split into training, validation, and testing sets with the ratio of 8:1:1. |
| Hardware Specification | Yes | Figure 1: (a) The inference time costs of LC-Rec (Zheng et al., 2024) with LLa MA-7B on a single A5000 GPU. We train the draft model for 20 epochs on 4 NVIDIA RTX A5000 GPUs. |
| Software Dependencies | No | The paper mentions using Adam W optimizer, LLa MA-7B, LLa MA-68M, and Lo RA fine-tuning technique, but does not provide specific version numbers for software libraries or frameworks like Python, PyTorch, or TensorFlow. |
| Experiment Setup | Yes | For draft model training, we use Adam W optimizer with batch size= 64, learning rate=0.001, and a cosine scheduler with warmup step of 200 to adjust the learning rate. We train the draft model for 20 epochs on 4 NVIDIA RTX A5000 GPUs. Meanwhile, we search the alignment strength α in {0.1, 0.3, 0.5, 0.7} and weight decay in {0.01, 0.1}. We set draft length γ = 4, number of recommended items K = {1, 3, 5, 10, 20}, and draft beam size N = 40. |