reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Multimodal Quantitative Language for Generative Recommendation

Authors: Jianyang Zhai, Zi-Feng Mai, Chang-Dong Wang, Feidiao Yang, Xiawu Zheng, Hui Li, Yonghong Tian

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate the effectiveness of MQL4GRec through extensive experiments and comparisons with existing methods, achieving improvements over the baseline by 11.18%, 14.82%, and 7.95% on the NDCG metric across three different datasets, respectively. We conduct extensive experiments and analyses on three public datasets, and the results validate the effectiveness of our proposed method.
Researcher Affiliation	Academia	1Sun Yat-sen University, 2Pengcheng Laboratory, 3Guangdong Key Laboratory of Big Data Analysis and Processing, 4Xiamen University, 5Peking University
Pseudocode	No	The paper describes the methodology in prose and mathematical equations in Section 3, titled 'MQL4GRec', but does not contain any explicitly labeled pseudocode or algorithm blocks.
Open Source Code	Yes	1Our implementation is available at: https://github.com/zhaijianyang/MQL4GRec.
Open Datasets	Yes	We evaluate the proposed approach on three public real-world benchmarks from the Amazon Product Reviews dataset (Ni et al., 2019), containing user reviews and item metadata from May 1996 to October 2018.
Dataset Splits	Yes	Following previous work (Rajput et al., 2023), we first filter out unpopular users and items with less than five interactions. Then, we create user behavior sequences based on the chronological order. The maximum item sequence length is uniformly set to 20 to meet all baseline requirements. We employ the leave-one-out strategy for evaluation. We perform full ranking evaluation over the entire item set instead of sample-based evaluation.
Hardware Specification	Yes	Our experiments utilize the Tesla V100 GPU. For pretraining, we use four cards, and for fine-tuning, we use two cards.
Software Dependencies	No	The paper mentions using LLaMA, CLIP's image branch (ViT-L/14 as backbone), T5 framework, and AdamW optimizer. However, it does not specify explicit version numbers for the underlying software frameworks or libraries (e.g., Python, PyTorch/TensorFlow, HuggingFace Transformers).
Experiment Setup	Yes	The level of codebooks is set to 4, with each level consisting of 256 codebook vectors, and each vector has a dimension of 32. The model is optimized using the Adam W optimizer, employing a learning rate of 0.001 and a batch size of 1024. We use 4 layers each for the transformer-based encoder and decoder models with 6 self-attention heads of dimension 64 in each layer. The MLP and the input dimension was set as 1024 and 128, respectively. The number of prompt tokens for every task is set to 4. We employ the Adam W (Loshchilov & Hutter, 2019) optimizer for model optimization, setting the weight decay to 0.01. During pre-training, we utilize a batch size of 4096 with a learning rate set to 0.001. For alignment tuning, we employ a batch size of 512 with a maximum learning rate of 5e-4, and utilize a cosine scheduler with warm-up to adjust the learning rate.