reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Revisiting XRec: How Collaborative Signals Influence LLM-Based Recommendation Explanations

Authors: Cătălin-Emanuel Brița, Hieu Nguyen, Lubov Chalakova, Nikola Petrov

TMLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this work, we reproduce and expand upon the findings of Ma et al. (2024). While our results validate most of the original authors claims, we were unable to fully replicate the reported performance improvements from injecting collaborative information into every LLM attention layer, nor the claimed effects of data sparsity. Beyond replication, our contributions provide evidence that the Graph Neural Network (GNN) component does not enhance explainability. Instead, the observed performance improvement is attributed to the Collaborative Information Adapter, which can act as a form of soft prompting, efficiently encoding task-specific information. This finding aligns with prior research suggesting that lightweight adaptation mechanisms can condition frozen LLMs for specific downstream tasks. Our implementation is open-source1. After performing the above steps, we verified that XRec (Ma et al., 2024) outperforms the baseline models. Our findings also confirm that incorporating user and item profile information improves performance and that both the profile and the injection component can contribute to improved personalization. However, we encountered challenges in reproducing the claims that injecting collaborative information across all LLM attention layers improves performance and that XRec is more effective under increased data sparsity. 4.1 Datasets 4.2 Training Configurations 4.3 Results reproducing original paper 4.4 Results beyond original paper
Researcher Affiliation	Academia	Catalin E. Brita* EMAIL University of Amsterdam Hieu Nguyen* EMAIL University of Amsterdam Lubov Chalakova* EMAIL University of Amsterdam Nikola Petrov* EMAIL University of Amsterdam
Pseudocode	No	The paper describes mathematical equations and diagrams the XRec framework in Figure 1, but it does not contain a clearly labeled pseudocode block or algorithm.
Open Source Code	Yes	Our implementation is open-source1. 1https://github.com/tr2512/Re XRec
Open Datasets	Yes	To build XRec, Ma et al. (2024) use three datasets, each containing user reviews for various items: Amazon Review Data3 (Ni et al., 2019), Google Local Data4 (Li et al., 2022; Yan et al., 2023), and Yelp Open Dataset5. 3https://cseweb.ucsd.edu/~jmcauley/datasets/amazon_v2/ 4https://jiachengli1995.github.io/google/index.html 5https://business.yelp.com/data/resources/open-dataset/
Dataset Splits	Yes	We provide the Train, Validation, and Test splits for the reproduced interaction datasets [Re]Amazon, [Re]Google, and [Re]Yelp in Table 9. These interaction datasets are used to train the Recommender System. From each interaction dataset, we extract a subset containing interactions with ground truth explanations, which we refer to as the explanation dataset. This explanation dataset is used to train the XRec framework. To ensure consistency, the splits for the explanation dataset were designed to match the numbers of the corresponding original datasets (Ma et al., 2024), and the same split ratios were applied to the interaction datasets. Table 9: Train, Validation, and Test splits for [Re]Amazon, [Re]Google, and [Re]Yelp datasets. Split Explanations Interactions Explanations Interactions Explanations Interactions Train 95,841 303,246 94,663 345,846 74,212 337,797 Validation 11,980 37,905 11,833 43,231 9,277 42,227 Test 3,000 9,493 3,000 10,961 3,000 13,656
Hardware Specification	Yes	Training and inference are conducted using a single NVIDIA A100 40GB GPU, while evaluation is performed on two NVIDIA T4 16GB GPUs.
Software Dependencies	No	The paper mentions several LLM models used (LLa MA 2 7B, GPT-3.5 Turbo, LLa MA 3.1 8B Instruct, LLa MA 3.2 3B Instruct, Gemma 2 9B-IT, Qwen 2.5 7B-Instruct) and an optimizer (Adam), but it does not specify software dependencies like programming language versions or library versions (e.g., Python 3.x, PyTorch 1.x).
Experiment Setup	Yes	The training process followed the hyperparameter configurations specified in Ma et al. (2024). An early stopping criterion is applied to the GNN-based models based on the Recall@20 metric. The LLM model used is LLa MA 2 7B (Touvron et al., 2023), which remains frozen throughout the training process. Appendix A shows the exact hyperparameter values for each phase. Table 8: Hyperparameter configurations for the GNN-based recommender system and the Information Collaboration Adapter module (Mo E), both optimized using Adam (Kingma & Ba, 2015). Hyperparameter GNN Mo E Batch size 1024 1 Number of epochs 300 1 Optimizer Adam Adam Learning rate 0.001 0.001 Number of layers 4 Embedding size 64 Early Stopping Patience 10 epochs Number of experts 8 Dropout rate 0.2 Gating Router Noise Factor 0.01