Harnessing Multimodal Large Language Models for Multimodal Sequential Recommendation

Authors: Yuyang Ye, Zhi Zheng, Yishan Shen, Tianshu Wang, Hengruo Zhang, Peijun Zhu, Runlong Yu, Kai Zhang, Hui Xiong

AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive evaluations across various datasets validate the effectiveness of MLLM-MSR, showcasing its superior ability to capture and adapt to the evolving dynamics of user preferences. [...] Experiments In this section, we detail the comprehensive experiment to validate the effectiveness of our proposed Multimodal Large Language Model for Sequential Multimodal Recommendation (MLLM-MSR).
Researcher Affiliation Collaboration 1Department of Management Science and Information Systems, Rutgers University 2School of Data Science, University of Science and Technology of China 3Department of Applied Mathematics and Computational Science, University of Pennsylvania 4Bytedance Inc. 5School of Computer Science, Georgia Institute of Technology 6Department of Computer Science, University of Pittsburgh 7Thrust of Artificial Intelligence, The Hong Kong University of Science and Technology (Guangzhou)
Pseudocode No The paper describes the method using figures (Figure 1, Figure 2, Figure 3) and prose, but does not include any clearly labeled pseudocode or algorithm blocks.
Open Source Code Yes Code https://github.com/Yuyang Ye/MLLM-MSR
Open Datasets Yes Our experimental evaluation utilized three open-source, real-world datasets from diverse recommendation system domains. These datasets include the Microlens Dataset (Ni et al. 2023), featuring user-item interactions, video introductions, and video cover images; the Amazon-Baby Dataset; and the Amazon-Game Dataset (He and Mc Auley 2016; Mc Auley et al. 2015), all of them contain user-item interactions, product descriptions, and images.
Dataset Splits Yes Additionally, we implemented a 1:1 ratio for negative sampling during training and a 1:20 ratio for evaluation. Further details on these datasets are provided in Table 2. [...] all results were obtained using 5-fold cross-validation and various random seeds, and achieved a 95% confidence level.
Hardware Specification Yes Our experiments were performed on a Linux server equipped with eight A800 80GB GPUs.
Software Dependencies Yes We utilized Llava-v1.6-mistral-7b for image description and recommendation tasks, and Llama3-8b-instruct 2 for summarizing user preferences. For the Supervised Fine Tuning (SFT) process, we employed the Py Torch Lightning library, using Lo RA with a rank of 8. The optimization was handled by the Adam W optimizer with a learning rate of 2e5 and a batch size of 1, setting gradient accumulation steps at 8 and epochs at 10. For distributed training, we implemented Deepspeed [28] with Ze RO stage 2.
Experiment Setup Yes The optimization was handled by the Adam W optimizer with a learning rate of 2e5 and a batch size of 1, setting gradient accumulation steps at 8 and epochs at 10.