reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Preference Diffusion for Recommendation

Authors: Shuo Liu, An Zhang, Guoqing Hu, Hong Qian, Tat-Seng Chua

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments across six benchmarks validate Prefer Diff s superior recommendation performance. Our codes are available at https://github.com/lswhim/Prefer Diff. 1 INTRODUCTION The recommender system endeavors to model the user preference distribution based on their historical behaviour data (He & Mc Auley, 2016; Wang et al., 2019; Rendle, 2022) and predict personalized item rankings. Recently, diffusion models (DMs) (Sohl-Dickstein et al., 2015; Ho et al., 2020; Yang et al., 2024) have gained considerable attention for their robust capacity to model complex data distributions and versatility across a wide range of applications, encompassing diverse input styles: texts (Li et al., 2022; Lovelace et al., 2023), images (Dhariwal & Nichol, 2021; Ho & Salimans, 2022) and videos (Ho et al., 2022a;b). As a result, there has been growing interest in employing DMs as recommenders in recommender systems. 4 EXPERIMENTS In this section, we aim to answer the following research questions: RQ1: How does Prefer Diff perform compared with other sequential recommenders? RQ2: Can Prefer Diff leverage pretraining to achieve commendable zero-shot performance on unseen datasets or datasets from other platforms just like DMs in other fields? RQ3: What is the impact of factors (e.g., λ) on Prefer Diff s performance? 4.1 PERFORMANCE OF SEQUENTIAL RECOMMENDATION Baselines. We comprehensively compare Prefer Diff with five categories of sequential recommenders: traditional sequential recommenders, including GRU4Rec (Hidasi et al., 2016), SASRec (Kang & Mc Auley, 2018), and BERT4Rec (Sun et al., 2019); contrastive learning-based recommenders, such as CL4SRec (Xie et al., 2022); generative sequential recommenders like TIGER (Rajput et al., 2023); DM-based recommenders, including Diff Rec (Wang et al., 2023b), Dream Rec (Yang et al., 2023b) and Diffu Rec (Li et al., 2024); and text-based recommenders like Mo Rec (Yuan et al., 2023) and LLM2Bert4Rec (Harte et al., 2023). See Appendix D.3 for details on the introduction, selection and hyperparameter of the baselines. Datasets. We evaluate the proposed Prefer Diff on six public real-world benchmarks (i.e., Sports, Beauty, and Toys from Amazon Reviews 2014 (He & Mc Auley, 2016), Steam, ML-1M, and Yahoo!R1).
Researcher Affiliation	Academia	Shuo Liu1,2 An Zhang2 Guoqing Hu3 Hong Qian1 Tat-Seng Chua2 1East China Normal University, China 2National University of Singapore, Singapore 3University of Science and Technology of China, China EMAIL, EMAIL, hl15671953077@ ustc.mail.edu.cn, EMAIL, EMAIL
Pseudocode	Yes	Algorithm 1 Training Phase of Prefer Diff 1: Input: Trainable parameters θ, training dataset Dtrain = {(e+ 0 , c, H)}\|Dtrain\| n=1 , total steps T, unconditional probability pu, learning rate η, variance schedules {αt}T t=1 2: Output: Updated parameters θ 3: repeat 4: (e+ 0 , c, H) Dtrain Sample data from training dataset. 5: With probability pu: c = Φ Set unconditional condition with probability pu. 6: t Uniform(1, T), ϵ+, ϵ N(0, I) Sample diffusion step and noise. 7: e+ t = αte+ 0 + 1 αtϵ+ Add noise to positive item embedding. 8: e t = αt V PV v=1 e v 0 + 1 αtϵ Add noise to negative item embeddings centroid. 9: θ θ η θLPrefer Diff(e+ t , e t , t, c, Φ; θ) Gradient descent update. 10: until convergence 11: return θ Algorithm 2 Inference Phase of Prefer Diff 1: Input: Trained parameters θ, Sequence encoder M( ), test dataset Dtest = {(e0, c)}\|Dtest\| n=1 , total steps T, DDIM steps S, guidance weight w, variance schedules {αt}T t=1 2: Output: Predicted next item ˆe0 3: c Dtest Sample user historical sequence from testing dataaset. 4: e T N(0, I) Sample standard Gaussian noise. 5: for s = S, . . . , 1 do Denoise over S DDIM steps. 6: t = s (T/S) Map DDIM step s to original step t. 7: With probability pu: M(c) = Φ Set unconditional condition with probability pu. 8: z N(0, I) if s > 1 else z = 0 Sample noise if not final step. 9: ˆe0 = (1 + w)Fθ(ˆet, M(c), t) w Fθ(ˆet, Φ, t) Apply classifier-free guidance. 10: ˆϵθ = ˆet αtˆe0 1 αt Compute predicted noise. 11: ˆet 1 = αt 1ˆe0 + 1 αt 1ˆϵθ DDIM update step when σt = 0. 12: end for 13: return ˆe0
Open Source Code	Yes	Our codes are available at https://github.com/lswhim/Prefer Diff. Reproducibility Statement. All results in this work are fully reproducible. The hyperparameter search space is discussed in Table 11, and further details about the hardware and software environment are provided in Appendix D.2. We provide the code and the best hyperparameters for our method at https://github.com/lswhim/Prefer Diff and Table 12.
Open Datasets	Yes	Extensive experiments across six benchmarks validate Prefer Diff s superior recommendation performance. Our codes are available at https://github.com/lswhim/Prefer Diff. 1 INTRODUCTION The recommender system endeavors to model the user preference distribution based on their historical behaviour data (He & Mc Auley, 2016; Wang et al., 2019; Rendle, 2022) and predict personalized item rankings. Recently, diffusion models (DMs) (Sohl-Dickstein et al., 2015; Ho et al., 2020; Yang et al., 2024) have gained considerable attention for their robust capacity to model complex data distributions and versatility across a wide range of applications, encompassing diverse input styles: texts (Li et al., 2022; Lovelace et al., 2023), images (Dhariwal & Nichol, 2021; Ho & Salimans, 2022) and videos (Ho et al., 2022a;b). As a result, there has been growing interest in employing DMs as recommenders in recommender systems. 4 EXPERIMENTS In this section, we aim to answer the following research questions: RQ1: How does Prefer Diff perform compared with other sequential recommenders? RQ2: Can Prefer Diff leverage pretraining to achieve commendable zero-shot performance on unseen datasets or datasets from other platforms just like DMs in other fields? RQ3: What is the impact of factors (e.g., λ) on Prefer Diff s performance? 4.1 PERFORMANCE OF SEQUENTIAL RECOMMENDATION Baselines. We comprehensively compare Prefer Diff with five categories of sequential recommenders: traditional sequential recommenders, including GRU4Rec (Hidasi et al., 2016), SASRec (Kang & Mc Auley, 2018), and BERT4Rec (Sun et al., 2019); contrastive learning-based recommenders, such as CL4SRec (Xie et al., 2022); generative sequential recommenders like TIGER (Rajput et al., 2023); DM-based recommenders, including Diff Rec (Wang et al., 2023b), Dream Rec (Yang et al., 2023b) and Diffu Rec (Li et al., 2024); and text-based recommenders like Mo Rec (Yuan et al., 2023) and LLM2Bert4Rec (Harte et al., 2023). See Appendix D.3 for details on the introduction, selection and hyperparameter of the baselines. Datasets. We evaluate the proposed Prefer Diff on six public real-world benchmarks (i.e., Sports, Beauty, and Toys from Amazon Reviews 2014 (He & Mc Auley, 2016), Steam, ML-1M, and Yahoo!R1). Detailed statistics of three benchmarks can be found in Table 5. Here, we utilize the common five-core datasets, filtering out users and items with fewer than five interactions. More Details about data prepossessing can be found in Appendix D.1. Following prior work (Yang et al., 2023b), in Table 1 and Table 14, we employ user-split which first sorts all sequences chronologically for each dataset, then split the data into training, validation, and test sets with an 8:1:1 ratio, while preserving the last 10 interactions as the historical sequence. We reproduce all baselines for a fair comparison. Notably, in Table 8 and Table 9 of Appendix D.4, we also give comparison under another setting (i.e., leave-one-out) to provide more insights where the baselines results are copied from TIGIR. Moreover, we conduct experiments on varied user history lengths in Appendix F.2.
Dataset Splits	Yes	Following prior work (Yang et al., 2023b), in Table 1 and Table 14, we employ user-split which first sorts all sequences chronologically for each dataset, then split the data into training, validation, and test sets with an 8:1:1 ratio, while preserving the last 10 interactions as the historical sequence. We reproduce all baselines for a fair comparison. Notably, in Table 8 and Table 9 of Appendix D.4, we also give comparison under another setting (i.e., leave-one-out) to provide more insights where the baselines results are copied from TIGIR. Moreover, we conduct experiments on varied user history lengths in Appendix F.2. D.1 DATASETS PREPOSSESSING IN USER SPLITTING SETTING Following prior works (Yang et al., 2023a;b), we adopt the user-splitting setting, which has been shown to effectively prevent information leakage in test sets (Ji et al., 2023). Specifically, we first sort all sequences chronologically for each dataset, then split the data into training, validation, and test sets with an 8:1:1 ratio, while preserving the last 10 interactions as the historical sequence.
Hardware Specification	Yes	For a fair comparison, all experiments are conducted in Py Torch using a single Tesla V100-SXM332GB GPU and an Intel(R) Xeon(R) Gold 6248R CPU.
Software Dependencies	No	For a fair comparison, all experiments are conducted in Py Torch using a single Tesla V100-SXM332GB GPU and an Intel(R) Xeon(R) Gold 6248R CPU. We optimize all methods using the Adam W optimizer and all models parameters are initialized with Standard Normal initialization. We fix the embedding dimension to 64 for all models except DM-based recommenders, as the latter only demonstrate strong performance with higher embedding dimensions, as discussed in Section 4.3. Since our focus is not on network architecture and for fair comparison, we adopt a lightweight configuration for baseline models that employ a Transformer backbone 6, using a single layer with two attention heads. Notably, all baselines, unless otherwise specified, use cross-entropy as the loss function, as recent studies (Zhang et al., 2024; Klenitskiy & Vasilev, 2023; Zhai et al., 2023) have demonstrated its effectiveness. For Perfer Diff, for each user sequence, we treat the other next-items (a.k.a., labels) in the same batch as negative samples. We set the default diffusion timestep to 2000, DDIM step as 20, pu = 0.1, and the β linearly increase in the range of [1e 4, 0.02] for all DM-basd sequential recommenders (e.g., Dream Rec). We empirically find that tuning these parameters may lead to better recommendation performance. However, as this is not the focus of the paper, we do not elaborate on it. The other hyperparameter (e.g., learning rate) search space for Prefer Diff and the baseline models is provided in Table 11, while the best hyperparameters for Prefer Diff are listed in Table 12.
Experiment Setup	Yes	We set the default diffusion timestep to 2000, DDIM step as 20, pu = 0.1, and the β linearly increase in the range of [1e 4, 0.02] for all DM-based sequential recommenders (e.g., Dream Rec). We empirically find that tuning these parameters may lead to better recommendation performance. However, as this is not the focus of the paper, we do not elaborate on it. The other hyperparameter (e.g., learning rate) search space for Prefer Diff and the baseline models is provided in Table 11, while the best hyperparameters for Prefer Diff are listed in Table 12.