reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

SLMRec: Distilling Large Language Models into Small for Sequential Recommendation

Authors: Wujiang Xu, Qitian Wu, Zujie Liang, Jiaojiao Han, Xuying Ning, Yunxiao Shi, Wenfang Lin, Yongfeng Zhang

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Comprehensive experimental results illustrate that the proposed SLMREC model attains the best performance using only 13% of the parameters found in LLM-based recommendation models, while simultaneously achieving up to 6.6x and 8.0x speedups in training and inference time costs, respectively. Besides, we provide a theoretical justification for why small language models can perform comparably to large language models in SR.
Researcher Affiliation	Collaboration	Wujiang Xu1, Qitian Wu2, Zujie Liang3, Jiaojiao Han4, Xuying Ning5, Yunxiao Shi6, Wenfang Lin3, Yongfeng Zhang1 1 Rutgers University 2 Eric and Wendy Schmidt Center, Broad Institute of MIT and Harvard 3 Ant Group 4 Dian Diagnostics Group Co. 5 University of Illinois Urbana-Champaign 6 University of Technology Sydney
Pseudocode	No	The paper describes its methodology through mathematical equations and textual explanations, but it does not contain a distinct section, figure, or block explicitly labeled as 'Pseudocode' or 'Algorithm'.
Open Source Code	Yes	The source code and datasets are available at the URL 1. 1https://github.com/Wujiang Xu/SLMRec
Open Datasets	Yes	To obtain large-scale industry data, we use the Amazon 18 version3 dataset in this paper. More details are shown in Section 5. 3https://nijianmo.github.io/amazon/index.html
Dataset Splits	Yes	The historical sequence of interactions for each user is divided into three segments: (1) the most recent interaction is reserved for testing, (2) the second most recent for validation, and (3) all preceding interactions are used for training. [...] In order to ensure an unbiased evaluation, we adopt the methodology employed in previous works (Krichene & Rendle, 2020; Zhao et al., 2020), wherein we randomly select 999 negative items (i.e., items that the user has not interacted with) and combine them with 1 positive item (i.e., a ground-truth interaction) to form our recommendation candidates for the ranking test.
Hardware Specification	Yes	We use mixed precision training and train on 1*80G Nvidia A100 GPU.
Software Dependencies	No	Our implementation is based on Huggingface Transformers 6. The paper mentions "Huggingface Transformers" but does not specify a version number for this or any other software dependency.
Experiment Setup	Yes	In Table 6, we provide hyper-parameters in our training stage. Our implementation is based on Huggingface Transformers 6. The input and intermediate hidden dimension in the feed-forward network is 4096. We use mixed precision training and train on 1*80G Nvidia A100 GPU.