reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Test-Time Adaptation on Recommender System with Data-Centric Graph Transformation

Authors: Yating Liu, Xin Zheng, Yi Li, Yanqing Guo

IJCAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experiments demonstrate TTA-GREC s superiority at test time and provide new data-centric insights on test-time adaptation for better recommender system inference. Extensive experiments on multiple public datasets demonstrate that TTA-GREC significantly outperforms existing methods on key metrics such as Recall and NDCG (e.g., 4.46% Recall and 1.86 NDCG improvement in Last-FM). To evaluate the effectiveness of the proposed TTA-GREC, we compare its performance with baseline methods across multiple datasets, as shown in Table 1. To evaluate the contribution of each submodule in TTA-GREC, we conduct ablation studies by sequentially removing: (I) w/o UI transformation: Removes UI graph transformation with only the original test UI list. (II) w/o KG revision: Removes KG transformation by only origanal KG embedding. (III) w/o CL: Removes sampling-based contrastive learning by the Euclidean distance of the embedding.
Researcher Affiliation	Academia	Yating Liu1 , Xin Zheng2 , Yi Li1 , Yanqing Guo1 1Dalian University of Technology, China 2Griffith University, Australia EMAIL, EMAIL, EMAIL, EMAIL
Pseudocode	No	The paper describes the methodology using mathematical formulations and descriptive text, but does not include any explicitly labeled 'Pseudocode' or 'Algorithm' blocks, nor does it present structured steps in a code-like format.
Open Source Code	No	The paper does not contain any explicit statement about releasing source code, a link to a code repository, or mention of code being available in supplementary materials. The conclusion mentions future work on 'more efficient TTA strategies' which suggests no immediate public release of code.
Open Datasets	Yes	We utilize three different datasets: Last-FM, MIND, and Alibaba-i Fashion, which respectively represent different domains of recommendation systems. Last-FM [Wang et al., 2019; Zhao et al., 2019]: It is a dataset of user-music interaction logs with rich metadata. MIND [Tian et al., 2021]: It is a news recommendation dataset with complex user-item interactions and semantic content. Alibaba-i Fashion [Wang et al., 2021]: It is a dataset focused on fashion product recommendations, featuring dynamic user preferences and detailed item attributes. We follow the procedures and partitions in previous works [Wang et al., 2019; Tian et al., 2021; Wang et al., 2021; Yang et al., 2023].
Dataset Splits	Yes	We follow the procedures and partitions in previous works [Wang et al., 2019; Tian et al., 2021; Wang et al., 2021; Yang et al., 2023]. For each KGNN model, we follow a standard training pipeline and train it on the training set until it achieves the best performance on the validation set in terms of recommendation.
Hardware Specification	Yes	Table 3: Runtime efficiency comparison (Evaluated on NVIDIA RTX 4090 GPU).
Software Dependencies	No	The paper mentions models and components like 'GCN', 'KGNNθ model', 'MLP', but does not specify any software libraries (e.g., PyTorch, TensorFlow) or their version numbers, nor the programming language version used for implementation.
Experiment Setup	Yes	We evaluate performance using Recall@N and NDCG@N, with N = 20, to assess the model s capability in generating top-N recommendations effectively. Hyper-parameter Sensitivity Analysis. The results in Figures 3 and 4 highlight the impact of mask size and temperature parameter on Recall and NDCG. Figure 3 shows the effect of different mask sizes on Recall and NDCG. The main observations are as follows: the best performance is achieved with a mask size of 128. Figure 4 shows the effect of different values of τ on Recall and NDCG. We observe that both Recall and NDCG reach their highest values when τ = 0.1.