reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Coherency Improved Explainable Recommendation via Large Language Model

Authors: Shijie Liu, Ruixin Ding, Weihai Lu, Jun Wang, Mo Yu, Xiaoming Shi, Wei Zhang

AAAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experimental results on three datasets of explainable recommendation show that the proposed framework is effective, outperforming stateof-the-art baselines with improvements of 7.3% in explainability and 4.4% in text quality. ... We conduct extensive experiments to demonstrate the effectiveness of the proposed framework against strong baselines, and experimental results show that training techniques can further improve the results. ... Experimental Setting Dataset To validate the effectiveness of our method, we conducted experiments on three publicly available datasets and their splits (Li, Zhang, and Chen 2020).
Researcher Affiliation	Collaboration	Shijie Liu1, Ruixing Ding1, Weihai Lu2*, Jun Wang1, Mo Yu3, Xiaoming Shi1, Wei Zhang1 1East China Normal University, 2Peking University, 3 We Chat AI, Tencent
Pseudocode	No	The paper describes the methodology using textual explanations, mathematical formulas, and diagrams (Figure 2, Figure 3), but does not include any explicitly labeled pseudocode or algorithm blocks with structured steps.
Open Source Code	Yes	Code https://github.com/karrich/CIER
Open Datasets	Yes	Dataset To validate the effectiveness of our method, we conducted experiments on three publicly available datasets and their splits (Li, Zhang, and Chen 2020). ... The three datasets are from Trip Advisor (hotel), Amazon (movies & TV), and Yelp (restaurant). ... The available datasets and keyword extraction tools are provided by Sentires (Zhang et al. 2014; Li et al. 2020).
Dataset Splits	Yes	Each dataset is randomly divided into training, validation, and test sets in an 8:1:1 ratio five times.
Hardware Specification	Yes	All the experiments are conducted on an NVIDIA H800 GPU.
Software Dependencies	No	The paper mentions models like LLaMA2-7B, GPT-4, gpt-4o, and bert-base-multilingual-uncased-sentiment, and optimizers like Adam W. However, it does not provide specific version numbers for general ancillary software libraries or programming languages (e.g., Python, PyTorch/TensorFlow versions) that are typically required for reproduction.
Experiment Setup	Yes	For CIER, λ is set to 0.1 and γ to 0.2, selected through grid search over the ranges [0.01, 0.1, 1.0, 10.0] and [0.0, 0.2, 0.5, 0.8, 1.0], respectively. The model is optimized using the Adam W (Loshchilov and Hutter 2017) optimizer with hierarchical learning rates: 10 4 for the Lora module and 10 3 for the other components. The training epoch is set to 3 and the embedding size d is set to 1024.