Efficient Inference for Large Language Model-based Generative Recommendation

Authors: Xinyu Lin, Chaoqun Yang, Wenjie Wang, Yongqi Li, Cunxiao Du, Fuli Feng, See-Kiong Ng, Tat-Seng Chua

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Empirical results on two real-world datasets demonstrate that At Speed significantly accelerates LLM-based generative recommendation, e.g., near 2 speedup under strict top-K verification and up to 2.5 speedup under relaxed sampling verification. We conduct extensive experiments using both verification strategies on two real-world recommendation datasets, demonstrating that At Speed significantly accelerates the decoding for LLM-based recommendation (around 2 speedup).
Researcher Affiliation Academia 1National University of Singapore 2Tsinghua University 3University of Science and Technology of China 4The Hong Kong Polytechnic University EMAIL, EMAIL, EMAIL, EMAIL, EMAIL
Pseudocode Yes Algorithm 1 SD step with Top-K Strict Verification Algorithm 2 SD step with Relaxed Sampling Verification
Open Source Code Yes The codes and datasets are available at https://github.com/Linxyhaha/At Speed.
Open Datasets Yes To evaluate our proposed framework, we instantiate At Speed on a SOTA LLM-based generative recommender model LC-Rec (Zheng et al., 2024) and test on two real-world recommendation datasets6 from the popular benchmark Amazon review datasets7. 1) Beauty contains user interactions with the beauty products and 2) Games collects the user interactions with the video games. Footnote 7: https://cseweb.ucsd.edu/~jmcauley/datasets/amazon_v2/.
Dataset Splits Yes For both Beauty and Games, all interactions are sorted according to the global timestamps, and then split into training, validation, and testing sets with the ratio of 8:1:1.
Hardware Specification Yes Figure 1: (a) The inference time costs of LC-Rec (Zheng et al., 2024) with LLa MA-7B on a single A5000 GPU. We train the draft model for 20 epochs on 4 NVIDIA RTX A5000 GPUs.
Software Dependencies No The paper mentions using Adam W optimizer, LLa MA-7B, LLa MA-68M, and Lo RA fine-tuning technique, but does not provide specific version numbers for software libraries or frameworks like Python, PyTorch, or TensorFlow.
Experiment Setup Yes For draft model training, we use Adam W optimizer with batch size= 64, learning rate=0.001, and a cosine scheduler with warmup step of 200 to adjust the learning rate. We train the draft model for 20 epochs on 4 NVIDIA RTX A5000 GPUs. Meanwhile, we search the alignment strength α in {0.1, 0.3, 0.5, 0.7} and weight decay in {0.01, 0.1}. We set draft length γ = 4, number of recommended items K = {1, 3, 5, 10, 20}, and draft beam size N = 40.