reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Enhanced Recommendation Systems with Retrieval-Augmented Large Language Model

Authors: Chuyuan Wei, Ke Duan, Shengda Zhuo, Hongchun Wang, Shuqiang Huang, Jie Liu

JAIR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experimental validation on two real-world datasets demonstrates the efficacy of our approach, significantly enhancing both the ac- curacy and robustness of recommendations compared to state-of-the-art methods.
Researcher Affiliation	Academia	Chuyuan Wei EMAIL College of Electrical and Information Engineering Beijing University of Civil Engineering and Architecture Beijing, China Ke Duan EMAIL College of Mechanical-Electronic and Vehicle Engineering Beijing University of Civil Engineering and Architecture Beijing, China Shengda Zhuo EMAIL (Corresponding Author) College of Cyber Security Jinan University Guangzhou, Guangdong, China Hongchun Wang EMAIL College of Urban Economics and Management Beijing University of Civil Engineering and Architecture Beijing, China Shuqiang Huang EMAIL (Corresponding Author) College of Cyber Security Jinan University Guangzhou, Guangdong, China Jie Liu EMAIL North China University of Technology Beijing, China
Pseudocode	Yes	Algorithm 1 Training Procedure of ER2ALM Input: Item set I, User set U, Historical interaction set HU Output: Top-k Recommendations
Open Source Code	No	This study employs the locally deployed Chat GLM3-6B3 model to enhance data through LLM-generated dialogs. The Adam W optimizer (Paszke et al., 2019) was employed for training, with learning rates ranging from [5 10 5, 1 10 3] for the Netflix dataset and [2.5 10 4, 9.5 10 4] for the Movie Lens dataset. For the LLMs parameters, the temperature was selected from {0.4, 0.8, 1}, aiming to balance the accuracy and richness of the generated content. The top-p value, used to control generation precision, was chosen from {0.6, 0.8, 1}. To maintain response integrity, data flow was disabled. For embedding generation, we utilized a 1024-dimensional Ro BERTa model to capture more detailed content. For noise reduction, the threshold was set to 0.4, with similarity judgments for distress added once the number of trusted embeddings reached 500.
Open Datasets	Yes	We conduct experiments using publicly available datasets, Netflix and Movie Lens10M (ML-10M), both of which contain basic information about the movies. Netflix1 released by Netflix, contains over 100 million anonymous movie ratings collected from users 1. https://www.kaggle.com/datasets/netflix-inc/netflix-prize-data Movie Lens is a widely used series of benchmark datasets in recommendation system tasks. ML-10M2 contains 10 million ratings and 100,000 tag applications applied to 10,000 movies by 72,000 users. 2. https://grouplens.org/datasets/movielens/10m/
Dataset Splits	No	To mitigate potential biases in the test sampling process, we adopt the all-ranking evaluation strategy (Wei et al., 2021a, 2020).
Hardware Specification	No	This study employs the locally deployed Chat GLM3-6B3 model to enhance data through LLM-generated dialogs.
Software Dependencies	Yes	This study employs the locally deployed Chat GLM3-6B3 model to enhance data through LLM-generated dialogs. 3. https://huggingface.co/THUDM/chatglm3-6b
Experiment Setup	Yes	In this part, we provide a concise overview of the general experimental setup, including details on the datasets, evaluation protocols, comparative baselines, and implementation specifics. Implementation Details. This study employs the locally deployed Chat GLM3-6B3 model to enhance data through LLM-generated dialogs. The Adam W optimizer (Paszke et al., 2019) was employed for training, with learning rates ranging from [5 10 5, 1 10 3] for the Netflix dataset and [2.5 10 4, 9.5 10 4] for the Movie Lens dataset. For the LLMs parameters, the temperature was selected from {0.4, 0.8, 1}, aiming to balance the accuracy and richness of the generated content. The top-p value, used to control generation precision, was chosen from {0.6, 0.8, 1}. To maintain response integrity, data flow was disabled. For embedding generation, we utilized a 1024-dimensional Ro BERTa model to capture more detailed content. For noise reduction, the threshold was set to 0.4, with similarity judgments for distress added once the number of trusted embeddings reached 500.