reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Preference Discerning with LLM-Enhanced Generative Retrieval

Authors: Fabian Paischer, Liu Yang, Linfeng Liu, Shuai Shao, Kaveh Hassani, Jiacheng Li, Ricky T. Q. Chen, Zhang Gabriel Li, Xiaoli Gao, Wei Shao, Xue Feng, Nima Noorshams, Sem Park, Bo Long, Hamid Eghbalzadeh

TMLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	To evaluate preference discerning, we introduce a novel benchmark that provides a holistic evaluation across various scenarios, including preference steering and sentiment following. Upon evaluating current state-of-the-art methods on our benchmark, we discover that their ability to dynamically adapt to evolving user preferences is limited. To address this, we propose a new method named Mender (Multimodal Preference Discerner), which achieves state-of-the-art performance in our benchmark. Our results show that Mender effectively adapts its recommendation guided by human preferences, even if not observed during training, paving the way toward more flexible recommendation models. ... We evaluate state-of-the-art generative retrieval methods on our benchmark and find that they lack several key abilities of preference discerning. Therefore, we introduce a novel multimodal generative retrieval method named Mixedmodal preference discerner (Mender) that effectively fuses pre-trained language encoders with the generative retrieval framework (Rajput et al., 2023) for preference discerning. ... 4 Experiments We evaluate our approach on four widely used datasets, namely three Amazon reviews subsets (Ni et al., 2019) and Steam (Kang & Mc Auley, 2018). ... We present a detailed analysis of the results obtained by the different methods on our benchmark for three subsets of Amazon reviews (Beauty, Sports and Outdoors, and Toys and Games) and Steam datasets. Fig. 4 and Fig. 5a show Recall@10 for all methods on the Amazon and Steam datasets, respectively. Table 1 also shows Recall@10 plus additional metrics, such as Recall@5, NDCG@5, NDCG@10, as well as relative improvements of Mender over the best baseline method.
Researcher Affiliation	Collaboration	Fabian Paischer ELLIS Unit, LIT AI Lab, Institute for Machine Learning, JKU Linz, Austria AI at Meta Liu Yang University of Wisconsin-Madison AI at Meta Linfeng Liu, Shuai Shao, Kaveh Hassani, Jiacheng Li, Ricky Chen, Zhang Gabriel Li, Xiaoli Gao, Wei Shao, Xue Feng, Nima Noorshams, Sem Park, Bo Long, Hamid Eghbalzadeh EMAIL AI at Meta
Pseudocode	Yes	Algorithm 1 Preference Approximation Input: prompt x, users U, items I, reviews R, Language Model LLM( ) ,user sequence length Tu 1: for u U do 2: for t {1, . . . , Tu} do 3: P(t) u LLM h x; i(1) u ; r(1) u ; . . . ; i(t) u ; r(t) u i 4: end for 5: end for
Open Source Code	Yes	1Code is available at https://github.com/facebookresearch/preference_discerning.
Open Datasets	Yes	We evaluate our approach on four widely used datasets, namely three Amazon reviews subsets (Ni et al., 2019) and Steam (Kang & Mc Auley, 2018).
Dataset Splits	Yes	We adopt a leave-last-out data split, where the penultimate item of a sequence is used for validation and the last item is used for testing (Kang & Mc Auley, 2018; Sun et al., 2019). Our evaluation benchmark is based only on the validation and test items of that split along with their paired user preferences, that is, we use preferences for training and inference. The remaining items in each sequence are used for training, except for the first item, since no user preferences are available for it.
Hardware Specification	Yes	All our methods are trained on single A100 or V100 GPUs using Py Torch (Paszke et al., 2019).
Software Dependencies	Yes	All our methods are trained on single A100 or V100 GPUs using Py Torch (Paszke et al., 2019). ... To generate user preferences, we utilize the Lla Ma-3-70B-Instruct2 model. ... For the sentiment classification, we employ the model trained by Hartmann et al. (2023)3.
Experiment Setup	Yes	For training our models, we use the preference-based recommendation data, which consists of a single user preference and the interaction history. Unless mentioned otherwise, the additional generated data splits (positive/negative and fine/coarse data) are used solely for evaluation purposes. Following (Rajput et al., 2023), we limit the maximum number of items in a user sequence to the 20 most recent ones. For the Beauty, Toys and Games, and Steam datasets, we found it beneficial to also fine-tune the language encoder, for which we use Lo RA (Hu et al., 2022). By default, we use the FLAN-T5-Small (Chung et al., 2024) language encoder for Mender Tok.