reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

MTSTRec: Multimodal Time-Aligned Shared Token Recommender

Authors: Ming-Yi Hong, Yen-Jung Hsu, Miao-Chen Chiang, Che Lin

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments demonstrate that MTSTRec achieves state-of-the-art performance across multiple sequential recommendation benchmarks, significantly improving upon existing multimodal fusion. Our code is available at https://github.com/idssplab/ MTSTRec.
Researcher Affiliation	Academia	1Data Science Degree Program, National Taiwan University and Academia Sinica, Taiwan 2Graduate Institute of Communication Engineering, National Taiwan University, Taiwan 3Department of Electrical Engineering, National Taiwan University, Taiwan. Correspondence to: Che Lin <EMAIL>.
Pseudocode	No	The paper describes methods in text and uses mathematical formulas, but does not include any explicit pseudocode blocks or algorithm listings.
Open Source Code	Yes	Our code is available at https://github.com/idssplab/ MTSTRec.
Open Datasets	Yes	Our experiments utilize three datasets: two proprietary datasets from Avivi D Innovative Multimedia Food E-commerce and House-Hold E-commerce, which have already been made publicly available1, and one public dataset from H&M. ... 1The datasets are available at https://github.com/ idssplab/MTSTRec. For more details regarding the dataset release, please refer to Appendix O.
Dataset Splits	Yes	The data is split chronologically into 75% for training, 12.5% for validation, and 12.5% for testing based on the purchase orders. For the H&M (Trousers) dataset, which contains only purchase actions, items are sorted by purchase time, and those bought on the last day are used as the answer set, ensuring consistency across all datasets (Meng et al., 2020).
Hardware Specification	No	The paper mentions that computational resources were provided by the National Center for High-performance Computing (NCHC), but does not specify the exact hardware models (e.g., specific GPUs, CPUs, or memory).
Software Dependencies	No	The paper mentions the use of various models and APIs (Llama 3.1, GPT-4o-mini, VGG-19, BERT), but does not provide specific version numbers for general software dependencies or libraries (e.g., Python version, PyTorch/TensorFlow version).
Experiment Setup	Yes	In our experiments, we tuned the hyperparameters based on validation data to ensure optimal performance. The batch size was uniformly set to 64 for all models, and the input dimension d was fixed at 512. We employed the Adam W optimizer while the maximum sequence length N was set to 20. The fusion layers were standardized across models, with Lfusion = 3 and a dropout rate of 0.1. ... The number of each encoder layer (Lmod) was tested across values of {2, 4, 8}, and the number of attention heads across {1, 2, 4, 8, 16}. We also experimented with dropout rates of {0.1, 0.2, 0.3} in the hidden layers. The learning rate was tested across a range of {0.001, 0.0005, 0.0001, 0.00005, 0.00001}, while the L2 regularization penalty was tuned from {0.0001, 0.00005, 0.00001, 0.000005, 0.000001}. A gamma value of {0.9, 0.75, 0.5}was set for learning rate decay.